Reshaping Data with tidyr

Jeroen Boeye - DataCamp

Course Description

Data in the wild can be scary—when confronted with a complicated and messy dataset you may find yourself wondering, where do I even start? The tidyr package allows you to wrangle such beasts into nice and tidy datasets. Inaccessible values stored in column names will be put into rows, JSON files will become data frames, and missing values will never go missing again. You’ll practice these techniques on a wide range of messy datasets, learning along the way how many dogs the Soviet Union sent into space and what bird is most popular in New Zealand. With the tidyr package in your tidyverse toolkit, you’ll be able to transform almost any dataset in a tidy format which will pay-off during the rest of your analysis.

1 Tidy data

You’ll be introduced to the concept of tidy data which is central to this course. In the first two lessons, you’ll jump straight into the action by separating messy character columns into tidy variables and observations ready for analysis. In the final lesson, you’ll learn how to overwrite and remove missing values.

1.1 What is tidy data?

1.1.1 Tidy data structure

You saw a number of tidy data concepts in the video, can you remember how they relate to a rectangular data structure?

Assign these concepts to their location in a rectangular data frame.

Rows: Observations
Columns: Variables
Cells: Values

Good job! This is the very foundation of tidy data.

1.1.2 Multiple variables per column

Being a busy person, you don’t want to spend too much time on Netflix, so you decide to crunch some numbers on TV show and movie durations before deciding what to watch. You’ve managed to obtain a dataset named netflix_df, but its duration column has an issue. It contains strings with both a value and unit of duration (“min” or “Season”).

You’ll tidy this dataset so that each variable gets its own column.

As will always be the case in this course, the tidyr package has been pre-loaded for you.

Inspect netflix_df by typing its name directly in the R console at the bottom-right and hitting Enter to see what string separates the value from the unit in the duration column.

Separate the duration column over two variables named value and unit. Pass the string separating the number from the unit to the sep argument.

netflix_df <- readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/netflix_duration.rds")
netflix_df %>% 
  # Split the duration column into value and unit columns
  separate(duration, into = c("value", "unit"), sep = " ", convert = TRUE)

## # A tibble: 637 × 4
##    title                  type    value unit  
##    <chr>                  <chr>   <int> <chr> 
##  1 Article 15             Movie     125 min   
##  2 Kill Me If You Dare    Movie     100 min   
##  3 The Spy                TV Show     1 Season
##  4 The World We Make      Movie     108 min   
##  5 Watchman               Movie      93 min   
##  6 Mo Gilligan: Momentum  Movie      64 min   
##  7 Domino                 Movie      89 min   
##  8 TUNA GIRL              Movie      90 min   
##  9 Bard of Blood          TV Show     1 Season
## 10 Dragons: Rescue Riders TV Show     1 Season
## # … with 627 more rows

Good job on creating your first tidy dataset in this course! Notice the convert = TRUE option? It caused the value column to be converted from string to integer.

1.2 Columns with multiple values

1.2.1 International phone numbers

You work for a multinational company that uses auto-dialer software to contact its customers. When new customers subscribe online they are asked for a phone number but they often forget to add the country code needed for international calls. You were asked to fix this issue in the database. You’ve been given a data frame with national numbers and country codes named phone_nr_df. Now you want to combine the country_code and national_number columns to create valid international numbers.

Use the unite() function to create a new international_number column, using an empty string as the separator.

phone_nr_df=tribble(~country,   ~country_code,  ~national_number,
"USA",  "+1",   "2025550117",
"United Kingdom",   "+44",  "1632960924",
"Brazil",   "+55",  "95552452220",
"Australia",    "+61",  "1900654321",
"China",    "+86",  "13555953217",
"India",    "+91",  "8555843898")
phone_nr_df %>%
  # Unite the country_code and national_number columns
  unite("international_number", country_code, national_number, sep = "")

## # A tibble: 6 × 2
##   country        international_number
##   <chr>          <chr>               
## 1 USA            +12025550117        
## 2 United Kingdom +441632960924       
## 3 Brazil         +5595552452220      
## 4 Australia      +611900654321       
## 5 China          +8613555953217      
## 6 India          +918555843898

Well done! Being able to unite columns is a good skill to have but in practice you’ll be separating them more often while tidying data.

1.2.2 Extracting observations from values

You’re given a sample of the Netflix dataset containing TV shows and their casts called tvshow_df. You want to learn which six actors have the most appearances.

However, the dataset only has one row per TV show, and multiple actors are listed in the cast column.

Transform the data so that for each TV show, every actor has a row. The number of appearances will be calculated for you.

The dplyr package has been pre-loaded for you.

tvshow_df=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/netflix_series.rds")
tvshow_df

## # A tibble: 1,759 × 2
##    title                            cast                                        
##    <chr>                            <chr>                                       
##  1 Transformers Prime               Peter Cullen, Sumalee Montano, Frank Welker…
##  2 Transformers: Robots in Disguise Will Friedle, Darren Criss, Constance Zimme…
##  3 Apaches                          Alberto Ammann, Eloy Azorín, Verónica Echeg…
##  4 Castle of Stars                  Chaiyapol Pupart, Jintanutda Lummakanon, Wo…
##  5 Archibald's Next Big Thing       Tony Hale, Rosamund Pike, Jordan Fisher, Ch…
##  6 The Spy                          Sacha Baron Cohen, Noah Emmerich, Hadar Rat…
##  7 No Tomorrow                      Joshua Sasse, Tori Anderson, Jonathan Langd…
##  8 Frequency                        Peyton List, Riley Smith, Mekhi Phifer, Dev…
##  9 Adam Ruins Everything            Adam Conover, Adam Lustick, Emily Axford    
## 10 Ben 10                           Tara Strong, Montse Hernandez, David Kaye, …
## # … with 1,749 more rows

Inspect tvshow_df in the console. What string separates actors in the cast column?

” “
“,”
“,”

Use separate_rows() on the cast column, using the appropriate separator for the sep argument.
Use the head() function to keep just the top six.

tvshow_df %>% 
  # Separate the actors in the cast column over multiple rows
  separate_rows(cast, sep = ", ") %>% 
  rename(actor = cast) %>% 
  count(actor, sort = TRUE) %>% 
  head()

## # A tibble: 6 × 2
##   actor                  n
##   <chr>              <int>
## 1 Takahiro Sakurai      18
## 2 Yuki Kaji             16
## 3 Daisuke Ono           14
## 4 David Attenborough    14
## 5 Ashleigh Ball         12
## 6 Hiroshi Kamiya        12

Well done! Once each actor has an observation per movie, getting the top six is easy. Interestingly, these are all voice actors or narrators.

1.2.3 Separating into columns and rows

Remember the drink ingredients data from the video? You’ve been given an similar version (drink_df) that also includes quantities and units. Now you want to create an overview of how much of each ingredient you should buy to make these drinks.

The dplyr package has been pre-loaded for you.

Inspect drink_df in the console to find the right separator in the ingredients column.
Separate the ingredients column so that for each drink each ingredient gets a row.

drink_df=tribble(~drink, ~ingredients,
"Chocolate milk", "milk 0.3 L; chocolate 40 g; sugar 10 g",
"Orange juice", "oranges 3; sugar 20 g", 
"Cappuccino", "milk 0.1 L; water 0.1 L; coffee 30 g; sugar 5 g")
drink_df %>% 
  # Separate the ingredients over rows
  separate_rows(ingredients, sep = "; ")

## # A tibble: 9 × 2
##   drink          ingredients   
##   <chr>          <chr>         
## 1 Chocolate milk milk 0.3 L    
## 2 Chocolate milk chocolate 40 g
## 3 Chocolate milk sugar 10 g    
## 4 Orange juice   oranges 3     
## 5 Orange juice   sugar 20 g    
## 6 Cappuccino     milk 0.1 L    
## 7 Cappuccino     water 0.1 L   
## 8 Cappuccino     coffee 30 g   
## 9 Cappuccino     sugar 5 g

Inspect the output of the previous step to find the separator that splits the ingredients column into three columns: ingredient, quantity, and unit.
Make sure to convert data types to numeric when possible.

drink_df %>% 
  # Separate the ingredients over rows
  separate_rows(ingredients, sep = "; ") %>% 
  # Separate ingredients into three columns
  separate(
    ingredients, 
    into = c("ingredient", "quantity", "unit"), 
    sep = " ", 
    convert = TRUE
  )

## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].

## # A tibble: 9 × 4
##   drink          ingredient quantity unit 
##   <chr>          <chr>         <dbl> <chr>
## 1 Chocolate milk milk            0.3 L    
## 2 Chocolate milk chocolate      40   g    
## 3 Chocolate milk sugar          10   g    
## 4 Orange juice   oranges         3   <NA> 
## 5 Orange juice   sugar          20   g    
## 6 Cappuccino     milk            0.1 L    
## 7 Cappuccino     water           0.1 L    
## 8 Cappuccino     coffee         30   g    
## 9 Cappuccino     sugar           5   g

Group the data by ingredient and unit.
Calculate the total quantity of each ingredient.

drink_df %>% 
  # Separate the ingredients over rows
  separate_rows(ingredients, sep = "; ") %>% 
  # Separate ingredients into three columns
  separate(
    ingredients, 
    into = c("ingredient", "quantity", "unit"), 
    sep = " ", 
    convert = TRUE
  ) %>% 
  # Group by ingredient and unit
  group_by(ingredient, unit) %>% 
  # Calculate the total quantity of each ingredient
  summarize(quantity = sum(quantity))

## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].

## `summarise()` has grouped output by 'ingredient'. You can override using the
## `.groups` argument.

## # A tibble: 6 × 3
## # Groups:   ingredient [6]
##   ingredient unit  quantity
##   <chr>      <chr>    <dbl>
## 1 chocolate  g         40  
## 2 coffee     g         30  
## 3 milk       L          0.4
## 4 oranges    <NA>       3  
## 5 sugar      g         35  
## 6 water      L          0.1

Good job! You separated a messy string column over both rows and columns and unlocked the data within. Note that the separate() function correctly added a missing value for the oranges unit.

1.3 Missing values

1.3.1 And the Oscar for best director goes to …

You’re working on a sample of the Netflix dataset pre-loaded as director_df. This time, the data frame contains just the directors and movie titles. Your goal is to identify the directors who created the most movies. Since the director column contains multiple names, you’ll first separate its values over multiple rows and then count the directors.

Since you don’t want movies without directors polluting your overview, you’ll apply the drop_na() function.

The dplyr package has been pre-loaded for you.

Inspect director_df in the console to see what string separates directors in the director column.
Spread the values in the director column over separate rows.

director_df=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/netflix_directors.rds")
director_df %>% 
  # Spread the director column over separate rows
  separate_rows(director, sep = ", ")

## # A tibble: 4,815 × 2
##    director              title                                  
##    <chr>                 <chr>                                  
##  1 Richard Finn          Norm of the North: King Sized Adventure
##  2 Tim Maltby            Norm of the North: King Sized Adventure
##  3 <NA>                  Jandino: Whatever it Takes             
##  4 Fernando Lebrija      #realityhigh                           
##  5 Gabe Ibáñez           Automata                               
##  6 Rodrigo Toro          Fabrizio Copano: Solo pienso en mi     
##  7 Francisco Schultz     Fabrizio Copano: Solo pienso en mi     
##  8 Henrik Ruben Genz     Good People                            
##  9 José Miguel Contreras Joaquín Reyes: Una y no más            
## 10 Daniel Alfredson      Kidnapping Mr. Heineken                
## # … with 4,805 more rows

Count the number of times each director appears in the data. Make sure to sort the output.

director_df %>% 
  # Spread the director column over separate rows
  separate_rows(director, sep = ", ") %>% 
  # Count the number of movies per director
  count(director, sort = TRUE)

## # A tibble: 3,536 × 2
##    director             n
##    <chr>            <int>
##  1 <NA>               128
##  2 Jan Suter           21
##  3 Raúl Campos         19
##  4 Jay Karas           14
##  5 Marcus Raboy        14
##  6 Jay Chapman         12
##  7 Martin Scorsese      9
##  8 Steven Spielberg     9
##  9 David Dhawan         8
## 10 Johnnie To           8
## # … with 3,526 more rows

Drop rows containing NA values in the director column.

director_df %>% 
  # Drop rows with NA values in the director column
  drop_na(director) %>% 
  # Spread the director column over separate rows
  separate_rows(director, sep = ", ") %>% 
  # Count the number of movies per director
  count(director, sort = TRUE)

## # A tibble: 3,535 × 2
##    director             n
##    <chr>            <int>
##  1 Jan Suter           21
##  2 Raúl Campos         19
##  3 Jay Karas           14
##  4 Marcus Raboy        14
##  5 Jay Chapman         12
##  6 Martin Scorsese      9
##  7 Steven Spielberg     9
##  8 David Dhawan         8
##  9 Johnnie To           8
## 10 Lance Bangs          8
## # … with 3,525 more rows

Good job! Missing values will often show up when you sort values by how often they appear in the data. Expect them to be there and remove them early on!

1.3.2 Imputing sales data

You’ve been asked to create a report that allows management to compare sales figures per quarter for two years. The problem is that the dataset (sales_df) contains missing values. You’ll need to impute the values in the year column so that you can visualize the data.

The ggplot2 package has been pre-loaded for you.

Inspect sales_df in the console, pay attention to the year column.
Use the fill() function to impute the year column in the correct direction.
Create a line plot where each year has a different color.

sales_df=tribble(~year,~quarter,~sales,
NA,"Q1","12498",
NA,"Q2","20461",
NA,"Q3","19737",
"2019","Q4","20314",
NA,"Q1","13494",
NA,"Q2","19314",
NA,"Q3","23640",
"2020","Q4","22920")
sales_df=sales_df %>% mutate(year=as.factor(year))

sales_df %>% 
  # Impute the year column
  fill(year, .direction = "up") %>%
  # Create a line plot with sales per quarter colored by year.
  ggplot(aes(x = quarter, y = sales, color = year, group = year)) +
  geom_line()

Alright! Without imputing the year column first, the visual would have been all messed up.

1.3.3 Nuclear bombs per continent

Since WWII, a number of nations have been detonating nuclear bombs for military research. A tally of bombs detonated per nation has been calculated from the Nuclear Explosion DataBase (NEDB) and provided as nuke_df. You are interested in finding out how many bombs have been detonated by nations grouped per continent. To achieve this goal, nuke_df will be joined to country_to_continent_df which is a mapping of nation to continent. You will need to overwrite missing values with zeros so that you can create a nice plot.

The dplyr and ggplot2 packages have been pre-loaded for you.

Side note 1: Bombs detonated by the Soviet Union were attributed to the Russian Federation.

Side note 2: The Russian Federation is solely mapped to Europe for simplicity.

Inspect nuke_df and country_to_continent_df in the console.
Replace the missing values in the n_bombs columns with 0L. Adding the L sets the data type to integer.

country_to_continent_df=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRPCvCEPOCd5KrjlVF_iXe2oDxIiXM_2wfoJzp-Za-5PusRoqUy3dkYJ22HCyyrBmT1KMha82C_SPeP/pub?gid=1254696327&single=true&output=csv")
nuke_df=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRpPo4a7Uj9jAtdmEC3p1YixmewdMQKKsKbuu3CoF4COKX7j-y0FpiRzxLGdaWVmb7mrvWOWEaph558/pub?gid=0&single=true&output=csv")
country_to_continent_df %>% 
  left_join(nuke_df, by = "country_code") %>% 
  # Impute the missing values in the n_bombs column with 0L
  replace_na(list(n_bombs = 0L))

##         continent country_code
## 1            Asia          AFG
## 2          Europe          ALB
## 3      Antarctica          ATA
## 4          Africa          DZA
## 5         Oceania          ASM
## 6          Europe          AND
## 7          Africa          AGO
## 8   North America          ATG
## 9          Europe          AZE
## 10           Asia          AZE
## 11  South America          ARG
## 12        Oceania          AUS
## 13         Europe          AUT
## 14  North America          BHS
## 15           Asia          BHR
## 16           Asia          BGD
## 17         Europe          ARM
## 18           Asia          ARM
## 19  North America          BRB
## 20         Europe          BEL
## 21  North America          BMU
## 22           Asia          BTN
## 23  South America          BOL
## 24         Europe          BIH
## 25         Africa          BWA
## 26     Antarctica          BVT
## 27  South America          BRA
## 28  North America          BLZ
## 29           Asia          IOT
## 30        Oceania          SLB
## 31  North America          VGB
## 32           Asia          BRN
## 33         Europe          BGR
## 34           Asia          MMR
## 35         Africa          BDI
## 36         Europe          BLR
## 37           Asia          KHM
## 38         Africa          CMR
## 39  North America          CAN
## 40         Africa          CPV
## 41  North America          CYM
## 42         Africa          CAF
## 43           Asia          LKA
## 44         Africa          TCD
## 45  South America          CHL
## 46           Asia          CHN
## 47           Asia          TWN
## 48           Asia          CXR
## 49           Asia          CCK
## 50  South America          COL
## 51         Africa          COM
## 52         Africa          MYT
## 53         Africa          COG
## 54         Africa          COD
## 55        Oceania          COK
## 56  North America          CRI
## 57         Europe          HRV
## 58  North America          CUB
## 59         Europe          CYP
## 60           Asia          CYP
## 61         Europe          CZE
## 62         Africa          BEN
## 63         Europe          DNK
## 64  North America          DMA
## 65  North America          DOM
## 66  South America          ECU
## 67  North America          SLV
## 68         Africa          GNQ
## 69         Africa          ETH
## 70         Africa          ERI
## 71         Europe          EST
## 72         Europe          FRO
## 73  South America          FLK
## 74     Antarctica          SGS
## 75        Oceania          FJI
## 76         Europe          FIN
## 77         Europe          ALA
## 78         Europe          FRA
## 79  South America          GUF
## 80        Oceania          PYF
## 81     Antarctica          ATF
## 82         Africa          DJI
## 83         Africa          GAB
## 84         Europe          GEO
## 85           Asia          GEO
## 86         Africa          GMB
## 87           Asia          PSE
## 88         Europe          DEU
## 89         Africa          GHA
## 90         Europe          GIB
## 91        Oceania          KIR
## 92         Europe          GRC
## 93  North America          GRL
## 94  North America          GRD
## 95  North America          GLP
## 96        Oceania          GUM
## 97  North America          GTM
## 98         Africa          GIN
## 99  South America          GUY
## 100 North America          HTI
## 101    Antarctica          HMD
## 102        Europe          VAT
## 103 North America          HND
## 104          Asia          HKG
## 105        Europe          HUN
## 106        Europe          ISL
## 107          Asia          IND
## 108          Asia          IDN
## 109          Asia          IRN
## 110          Asia          IRQ
## 111        Europe          IRL
## 112          Asia          ISR
## 113        Europe          ITA
## 114        Africa          CIV
## 115 North America          JAM
## 116          Asia          JPN
## 117        Europe          KAZ
## 118          Asia          KAZ
## 119          Asia          JOR
## 120        Africa          KEN
## 121          Asia          PRK
## 122          Asia          KOR
## 123          Asia          KWT
## 124          Asia          KGZ
## 125          Asia          LAO
## 126          Asia          LBN
## 127        Africa          LSO
## 128        Europe          LVA
## 129        Africa          LBR
## 130        Africa          LBY
## 131        Europe          LIE
## 132        Europe          LTU
## 133        Europe          LUX
## 134          Asia          MAC
## 135        Africa          MDG
## 136        Africa          MWI
## 137          Asia          MYS
## 138          Asia          MDV
## 139        Africa          MLI
## 140        Europe          MLT
## 141 North America          MTQ
## 142        Africa          MRT
## 143        Africa          MUS
## 144 North America          MEX
## 145        Europe          MCO
## 146          Asia          MNG
## 147        Europe          MDA
## 148        Europe          MNE
## 149 North America          MSR
## 150        Africa          MAR
## 151        Africa          MOZ
## 152          Asia          OMN
## 153        Africa          NAM
## 154       Oceania          NRU
## 155          Asia          NPL
## 156        Europe          NLD
## 157 North America          ANT
## 158 North America          CUW
## 159 North America          ABW
## 160 North America          SXM
## 161 North America          BES
## 162       Oceania          NCL
## 163       Oceania          VUT
## 164       Oceania          NZL
## 165 North America          NIC
## 166        Africa          NER
## 167        Africa          NGA
## 168       Oceania          NIU
## 169       Oceania          NFK
## 170        Europe          NOR
## 171       Oceania          MNP
## 172       Oceania          UMI
## 173 North America          UMI
## 174       Oceania          FSM
## 175       Oceania          MHL
## 176       Oceania          PLW
## 177          Asia          PAK
## 178 North America          PAN
## 179       Oceania          PNG
## 180 South America          PRY
## 181 South America          PER
## 182          Asia          PHL
## 183       Oceania          PCN
## 184        Europe          POL
## 185        Europe          PRT
## 186        Africa          GNB
## 187          Asia          TLS
## 188 North America          PRI
## 189          Asia          QAT
## 190        Africa          REU
## 191        Europe          ROU
## 192        Europe          RUS
## 193        Africa          RWA
## 194 North America          BLM
## 195        Africa          SHN
## 196 North America          KNA
## 197 North America          AIA
## 198 North America          LCA
## 199 North America          MAF
## 200 North America          SPM
## 201 North America          VCT
## 202        Europe          SMR
## 203        Africa          STP
## 204          Asia          SAU
## 205        Africa          SEN
## 206        Europe          SRB
## 207        Africa          SYC
## 208        Africa          SLE
## 209          Asia          SGP
## 210        Europe          SVK
## 211          Asia          VNM
## 212        Europe          SVN
## 213        Africa          SOM
## 214        Africa          ZAF
## 215        Africa          ZWE
## 216        Europe          ESP
## 217        Africa          SSD
## 218        Africa          ESH
## 219        Africa          SDN
## 220 South America          SUR
## 221        Europe          SJM
## 222        Africa          SWZ
## 223        Europe          SWE
## 224        Europe          CHE
## 225          Asia          SYR
## 226          Asia          TJK
## 227          Asia          THA
## 228        Africa          TGO
## 229       Oceania          TKL
## 230       Oceania          TON
## 231 North America          TTO
## 232          Asia          ARE
## 233        Africa          TUN
## 234        Europe          TUR
## 235          Asia          TUR
## 236          Asia          TKM
## 237 North America          TCA
## 238       Oceania          TUV
## 239        Africa          UGA
## 240        Europe          UKR
## 241        Europe          MKD
## 242        Africa          EGY
## 243        Europe          GBR
## 244        Europe          GGY
## 245        Europe          JEY
## 246        Europe          IMN
## 247        Africa          TZA
## 248 North America          USA
## 249 North America          VIR
## 250        Africa          BFA
## 251 South America          URY
## 252          Asia          UZB
## 253 South America          VEN
## 254       Oceania          WLF
## 255       Oceania          WSM
## 256          Asia          YEM
## 257        Africa          ZMB
## 258     Oceania >          <NA
##                                                                  country_name
## 1                                            Afghanistan, Islamic Republic of
## 2                                                        Albania, Republic of
## 3                                Antarctica (the territory South of 60 deg S)
## 4                                    Algeria, People's Democratic Republic of
## 5                                                              American Samoa
## 6                                                    Andorra, Principality of
## 7                                                         Angola, Republic of
## 8                                                         Antigua and Barbuda
## 9                                                     Azerbaijan, Republic of
## 10                                                    Azerbaijan, Republic of
## 11                                              Argentina, Argentine Republic
## 12                                                 Australia, Commonwealth of
## 13                                                       Austria, Republic of
## 14                                               Bahamas, Commonwealth of the
## 15                                                        Bahrain, Kingdom of
## 16                                           Bangladesh, People's Republic of
## 17                                                       Armenia, Republic of
## 18                                                       Armenia, Republic of
## 19                                                                   Barbados
## 20                                                        Belgium, Kingdom of
## 21                                                                    Bermuda
## 22                                                         Bhutan, Kingdom of
## 23                                                       Bolivia, Republic of
## 24                                                     Bosnia and Herzegovina
## 25                                                      Botswana, Republic of
## 26                                                  Bouvet Island (Bouvetoya)
## 27                                             Brazil, Federative Republic of
## 28                                                                     Belize
## 29         British Indian Ocean Territory (Chagos Archipel~\nSolomon Islands"
## 30                                                     British Virgin Islands
## 31                                                          Brunei Darussalam
## 32                                                      Bulgaria, Republic of
## 33                                                          Myanmar, Union of
## 34                                                       Burundi, Republic of
## 35                                                       Belarus, Republic of
## 36                                                       Cambodia, Kingdom of
## 37                                                      Cameroon, Republic of
## 38                                                                     Canada
## 39                                                    Cape Verde, Republic of
## 40                                                             Cayman Islands
## 41                                                   Central African Republic
## 42                                Sri Lanka, Democratic Socialist Republic of
## 43                                                          Chad, Republic of
## 44                                                         Chile, Republic of
## 45                                                China, People's Republic of
## 46                                                                     Taiwan
## 47                                                           Christmas Island
## 48                                                    Cocos (Keeling) Islands
## 49                                                      Colombia, Republic of
## 50                                                      Comoros, Union of the
## 51                                                                    Mayotte
## 52                                                     Congo, Republic of the
## 53                                          Congo, Democratic Republic of the
## 54                                                               Cook Islands
## 55                                                    Costa Rica, Republic of
## 56                                                       Croatia, Republic of
## 57                                                          Cuba, Republic of
## 58                                                        Cyprus, Republic of
## 59                                                        Cyprus, Republic of
## 60                                                             Czech Republic
## 61                                                         Benin, Republic of
## 62                                                        Denmark, Kingdom of
## 63                                                  Dominica, Commonwealth of
## 64                                                         Dominican Republic
## 65                                                       Ecuador, Republic of
## 66                                                   El Salvador, Republic of
## 67                                             Equatorial Guinea, Republic of
## 68                                   Ethiopia, Federal Democratic Republic of
## 69                                                          Eritrea, State of
## 70                                                       Estonia, Republic of
## 71                                                              Faroe Islands
## 72                                                Falkland Islands (Malvinas)
## 73                               South Georgia and the South Sandwich Islands
## 74                                         Fiji, Republic of the Fiji Islands
## 75                                                       Finland, Republic of
## 76                                                 \\u00c3\\u2026land Islands
## 77                                                    France, French Republic
## 78                                                              French Guiana
## 79                                                           French Polynesia
## 80                                                French Southern Territories
## 81                                                      Djibouti, Republic of
## 82                                                   Gabon, Gabonese Republic
## 83                                                                    Georgia
## 84                                                                    Georgia
## 85                                                    Gambia, Republic of the
## 86                                            Palestinian Territory, Occupied
## 87                                               Germany, Federal Republic of
## 88                                                         Ghana, Republic of
## 89                                                                  Gibraltar
## 90                                                      Kiribati, Republic of
## 91                                                  Greece, Hellenic Republic
## 92                                                                  Greenland
## 93                                                                    Grenada
## 94                                                                 Guadeloupe
## 95                                                                       Guam
## 96                                                     Guatemala, Republic of
## 97                                                        Guinea, Republic of
## 98                                           Guyana, Co-operative Republic of
## 99                                                         Haiti, Republic of
## 100                                         Heard Island and McDonald Islands
## 101                                             Holy See (Vatican City State)
## 102                                                     Honduras, Republic of
## 103   Hong Kong, Special Administrative Region of Chi~\nHungary, Republic of"
## 104                                                      Iceland, Republic of
## 105                                                        India, Republic of
## 106                                                    Indonesia, Republic of
## 107                                                 Iran, Islamic Republic of
## 108                                                         Iraq, Republic of
## 109                                                                   Ireland
## 110                                                          Israel, State of
## 111                                                   Italy, Italian Republic
## 112                                                Cote d'Ivoire, Republic of
## 113                                                                   Jamaica
## 114                                                                     Japan
## 115                                                   Kazakhstan, Republic of
## 116                                                   Kazakhstan, Republic of
## 117                                              Jordan, Hashemite Kingdom of
## 118                                                        Kenya, Republic of
## 119                                    Korea, Democratic People's Republic of
## 120                                                        Korea, Republic of
## 121                                                          Kuwait, State of
## 122                                                           Kyrgyz Republic
## 123                                          Lao People's Democratic Republic
## 124                                                Lebanon, Lebanese Republic
## 125                                                       Lesotho, Kingdom of
## 126                                                       Latvia, Republic of
## 127                                                      Liberia, Republic of
## 128                                                    Libyan Arab Jamahiriya
## 129                                            Liechtenstein, Principality of
## 130                                                    Lithuania, Republic of
## 131                                                Luxembourg, Grand Duchy of
## 132                             Macao, Special Administrative Region of China
## 133                                                   Madagascar, Republic of
## 134                                                       Malawi, Republic of
## 135                                                                  Malaysia
## 136                                                     Maldives, Republic of
## 137                                                         Mali, Republic of
## 138                                                        Malta, Republic of
## 139                                                                Martinique
## 140                                           Mauritania, Islamic Republic of
## 141                                                    Mauritius, Republic of
## 142                                             Mexico, United Mexican States
## 143                                                   Monaco, Principality of
## 144                                                                  Mongolia
## 145                                                      Moldova, Republic of
## 146                                                   Montenegro, Republic of
## 147                                                                Montserrat
## 148                                                       Morocco, Kingdom of
## 149                                                   Mozambique, Republic of
## 150                                                        Oman, Sultanate of
## 151                                                      Namibia, Republic of
## 152                                                        Nauru, Republic of
## 153                                                           Nepal, State of
## 154                                               Netherlands, Kingdom of the
## 155                                                      Netherlands Antilles
## 156                                                      Cura\\u00c3\\u00a7ao
## 157                                                                     Aruba
## 158                                                Sint Maarten (Netherlands)
## 159                                          Bonaire, Sint Eustatius and Saba
## 160                                                             New Caledonia
## 161                                                      Vanuatu, Republic of
## 162                                                               New Zealand
## 163                                                    Nicaragua, Republic of
## 164                                                        Niger, Republic of
## 165                                              Nigeria, Federal Republic of
## 166                                                                      Niue
## 167                                                            Norfolk Island
## 168                                                        Norway, Kingdom of
## 169                             Northern Mariana Islands, Commonwealth of the
## 170                                      United States Minor Outlying Islands
## 171                                      United States Minor Outlying Islands
## 172                                           Micronesia, Federated States of
## 173                                         Marshall Islands, Republic of the
## 174                                                        Palau, Republic of
## 175                                             Pakistan, Islamic Republic of
## 176                                                       Panama, Republic of
## 177                                    Papua New Guinea, Independent State of
## 178                                                     Paraguay, Republic of
## 179                                                         Peru, Republic of
## 180                                              Philippines, Republic of the
## 181                                                          Pitcairn Islands
## 182                                                       Poland, Republic of
## 183                                             Portugal, Portuguese Republic
## 184                                                Guinea-Bissau, Republic of
## 185                                       Timor-Leste, Democratic Republic of
## 186                                              Puerto Rico, Commonwealth of
## 187                                                           Qatar, State of
## 188                                                                   Reunion
## 189                                                                   Romania
## 190                                                        Russian Federation
## 191                                                       Rwanda, Republic of
## 192                                                          Saint Barthelemy
## 193                                                              Saint Helena
## 194                                      Saint Kitts and Nevis, Federation of
## 195                                                                  Anguilla
## 196                                                               Saint Lucia
## 197                                                              Saint Martin
## 198                                                 Saint Pierre and Miquelon
## 199                                          Saint Vincent and the Grenadines
## 200                                                   San Marino, Republic of
## 201                             Sao Tome and Principe, Democratic Republic of
## 202                                                  Saudi Arabia, Kingdom of
## 203                                                      Senegal, Republic of
## 204                                                       Serbia, Republic of
## 205                                                   Seychelles, Republic of
## 206                                                 Sierra Leone, Republic of
## 207                                                    Singapore, Republic of
## 208                                                Slovakia (Slovak Republic)
## 209                                            Vietnam, Socialist Republic of
## 210                                                     Slovenia, Republic of
## 211                                                  Somalia, Somali Republic
## 212                                                 South Africa, Republic of
## 213                                                     Zimbabwe, Republic of
## 214                                                         Spain, Kingdom of
## 215                                                               South Sudan
## 216                                                            Western Sahara
## 217                                                        Sudan, Republic of
## 218                                                     Suriname, Republic of
## 219                                              Svalbard & Jan Mayen Islands
## 220                                                     Swaziland, Kingdom of
## 221                                                        Sweden, Kingdom of
## 222                                          Switzerland, Swiss Confederation
## 223                                                      Syrian Arab Republic
## 224                                                   Tajikistan, Republic of
## 225                                                      Thailand, Kingdom of
## 226                                                   Togo, Togolese Republic
## 227                                                                   Tokelau
## 228                                                         Tonga, Kingdom of
## 229                                          Trinidad and Tobago, Republic of
## 230                                                      United Arab Emirates
## 231                                                Tunisia, Tunisian Republic
## 232                                                       Turkey, Republic of
## 233                                                       Turkey, Republic of
## 234                                                              Turkmenistan
## 235                                                  Turks and Caicos Islands
## 236                                                                    Tuvalu
## 237                                                       Uganda, Republic of
## 238                                                                   Ukraine
## 239                                Macedonia, The Former Yugoslav Republic of
## 240                                                   Egypt, Arab Republic of
## 241 United Kingdom of Great Britain & Northern Irel~\nGuernsey, Bailiwick of"
## 242                                                      Jersey, Bailiwick of
## 243                                                               Isle of Man
## 244                                              Tanzania, United Republic of
## 245                                                  United States of America
## 246                                              United States Virgin Islands
## 247                                                              Burkina Faso
## 248                                              Uruguay, Eastern Republic of
## 249                                                   Uzbekistan, Republic of
## 250                                         Venezuela, Bolivarian Republic of
## 251                                                         Wallis and Futuna
## 252                                               Samoa, Independent State of
## 253                                                                     Yemen
## 254                                                       Zambia, Republic of
## 255                                                        Disputed Territory
## 256                                                                          
## 257                                                                          
## 258                                                                          
##     n_bombs
## 1         0
## 2         0
## 3         0
## 4         0
## 5         0
## 6         0
## 7         0
## 8         0
## 9         0
## 10        0
## 11        0
## 12        0
## 13        0
## 14        0
## 15        0
## 16        0
## 17        0
## 18        0
## 19        0
## 20        0
## 21        0
## 22        0
## 23        0
## 24        0
## 25        0
## 26        0
## 27        0
## 28        0
## 29        0
## 30        0
## 31        0
## 32        0
## 33        0
## 34        0
## 35        0
## 36        0
## 37        0
## 38        0
## 39        0
## 40        0
## 41        0
## 42        0
## 43        0
## 44        0
## 45        0
## 46       45
## 47        0
## 48        0
## 49        0
## 50        0
## 51        0
## 52        0
## 53        0
## 54        0
## 55        0
## 56        0
## 57        0
## 58        0
## 59        0
## 60        0
## 61        0
## 62        0
## 63        0
## 64        0
## 65        0
## 66        0
## 67        0
## 68        0
## 69        0
## 70        0
## 71        0
## 72        0
## 73        0
## 74        0
## 75        0
## 76        0
## 77        0
## 78      200
## 79        0
## 80        0
## 81        0
## 82        0
## 83        0
## 84        0
## 85        0
## 86        0
## 87        0
## 88        0
## 89        0
## 90        0
## 91        0
## 92        0
## 93        0
## 94        0
## 95        0
## 96        0
## 97        0
## 98        0
## 99        0
## 100       0
## 101       0
## 102       0
## 103       0
## 104       0
## 105       0
## 106       0
## 107       6
## 108       0
## 109       0
## 110       0
## 111       0
## 112       0
## 113       0
## 114       0
## 115       0
## 116       0
## 117       0
## 118       0
## 119       0
## 120       0
## 121       6
## 122       0
## 123       0
## 124       0
## 125       0
## 126       0
## 127       0
## 128       0
## 129       0
## 130       0
## 131       0
## 132       0
## 133       0
## 134       0
## 135       0
## 136       0
## 137       0
## 138       0
## 139       0
## 140       0
## 141       0
## 142       0
## 143       0
## 144       0
## 145       0
## 146       0
## 147       0
## 148       0
## 149       0
## 150       0
## 151       0
## 152       0
## 153       0
## 154       0
## 155       0
## 156       0
## 157       0
## 158       0
## 159       0
## 160       0
## 161       0
## 162       0
## 163       0
## 164       0
## 165       0
## 166       0
## 167       0
## 168       0
## 169       0
## 170       0
## 171       0
## 172       0
## 173       0
## 174       0
## 175       0
## 176       0
## 177       2
## 178       0
## 179       0
## 180       0
## 181       0
## 182       0
## 183       0
## 184       0
## 185       0
## 186       0
## 187       0
## 188       0
## 189       0
## 190       0
## 191       0
## 192     726
## 193       0
## 194       0
## 195       0
## 196       0
## 197       0
## 198       0
## 199       0
## 200       0
## 201       0
## 202       0
## 203       0
## 204       0
## 205       0
## 206       0
## 207       0
## 208       0
## 209       0
## 210       0
## 211       0
## 212       0
## 213       0
## 214       0
## 215       0
## 216       0
## 217       0
## 218       0
## 219       0
## 220       0
## 221       0
## 222       0
## 223       0
## 224       0
## 225       0
## 226       0
## 227       0
## 228       0
## 229       0
## 230       0
## 231       0
## 232       0
## 233       0
## 234       0
## 235       0
## 236       0
## 237       0
## 238       0
## 239       0
## 240       0
## 241       0
## 242       0
## 243      21
## 244       0
## 245       0
## 246       0
## 247       0
## 248    1150
## 249       0
## 250       0
## 251       0
## 252       0
## 253       0
## 254       0
## 255       0
## 256       0
## 257       0
## 258       0

Group the dataset by continent and aggregate the data by summing the number of bombs.

country_to_continent_df %>% 
  left_join(nuke_df, by = "country_code") %>% 
  # Impute the missing values in the n_bombs column with 0L
  replace_na(list(n_bombs = 0L)) %>% 
  # Group the dataset by continent
  group_by(continent) %>% 
  # Sum the number of bombs per continent
  summarize(n_bombs_continent = sum(n_bombs))

## # A tibble: 8 × 2
##   continent     n_bombs_continent
##   <chr>                     <int>
## 1 Africa                        0
## 2 Antarctica                    0
## 3 Asia                         59
## 4 Europe                      947
## 5 North America              1150
## 6 Oceania                       0
## 7 Oceania >                     0
## 8 South America                 0

Plot the summed number of bombs detonated by nations from each continent.

country_to_continent_df %>% 
  left_join(nuke_df, by = "country_code") %>%  
  # Impute the missing values in the n_bombs column with 0L
  replace_na(list(n_bombs = 0L)) %>% 
  # Group the dataset by continent
  group_by(continent) %>% 
  # Sum the number of bombs per continent
  summarize(n_bombs_continent = sum(n_bombs)) %>% 
  # Plot the number of bombs per continent
  ggplot(aes(x = continent, y = n_bombs_continent)) +
  geom_col()

Kaboom! You just finished Chapter 1! Because you replaced missing values with zeros for non nuclear powers, all continents show up in the plot.

Enjoying the course so far? Tell us what you think via Twitter!

2 From wide to long and back

This chapter is all about pivoting data from a wide to long format and back again using the pivot_longer() and pivot_wider() functions. You’ll need these functions when variables are hidden in messy column names or when variables are stored in rows instead of columns. You’ll learn about space dogs, nuclear bombs, and planet temperatures along the way.

2.1 From wide to long data

2.1.1 Nuclear bombs per country

You’ve been given a version of the Nuclear Explosion DataBase (NEDB) where country names are specified in the column headers (nuke_df). You want to visualize how many nukes were detonated per year per country. You’ll need to pivot the data and replace NA values first.

The ggplot2 package has been pre-loaded for you.

“A-Day” First atomic bomb explosion at Bikini in the Marshall Islands 1 July 1946.

Photo by Science in HD

Pivot all columns except for year to a longer format.

nuke_df=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/nuke_wide_country.rds")
nuke_df %>% 
  # Pivot the data to a longer format
  pivot_longer(-year)

## # A tibble: 584 × 3
##     year name               value
##    <dbl> <chr>              <int>
##  1  1945 United States          3
##  2  1945 Russian Federation    NA
##  3  1945 United Kingdom        NA
##  4  1945 France                NA
##  5  1945 China                 NA
##  6  1945 India                 NA
##  7  1945 Pakistan              NA
##  8  1945 North Korea           NA
##  9  1946 United States          2
## 10  1946 Russian Federation    NA
## # … with 574 more rows

The country names are now in the name column. Overwrite its name with country. The value column should be named n_bombs.

nuke_df %>% 
  # Pivot the data to a longer format
  pivot_longer(
    -year, 
    # Overwrite the names of the two new columns
    names_to = "country", 
    values_to = "n_bombs"
  )

## # A tibble: 584 × 3
##     year country            n_bombs
##    <dbl> <chr>                <int>
##  1  1945 United States            3
##  2  1945 Russian Federation      NA
##  3  1945 United Kingdom          NA
##  4  1945 France                  NA
##  5  1945 China                   NA
##  6  1945 India                   NA
##  7  1945 Pakistan                NA
##  8  1945 North Korea             NA
##  9  1946 United States            2
## 10  1946 Russian Federation      NA
## # … with 574 more rows

Replace the NA values in the n_bombs column with integer zero values (0L).

nuke_df %>% 
  # Pivot the data to a longer format
  pivot_longer(
    -year, 
    # Overwrite the names of the two new columns
    names_to = "country", 
    values_to = "n_bombs"
  ) %>% 
  # Replace NA values for n_bombs with 0L
  replace_na(list(n_bombs = 0L))

## # A tibble: 584 × 3
##     year country            n_bombs
##    <dbl> <chr>                <int>
##  1  1945 United States            3
##  2  1945 Russian Federation       0
##  3  1945 United Kingdom           0
##  4  1945 France                   0
##  5  1945 China                    0
##  6  1945 India                    0
##  7  1945 Pakistan                 0
##  8  1945 North Korea              0
##  9  1946 United States            2
## 10  1946 Russian Federation       0
## # … with 574 more rows

Create a line plot where the number of bombs dropped per country is plotted over time. Use country to color the lines.

nuke_df %>% 
  # Pivot the data to a longer format
  pivot_longer(
    -year, 
    # Overwrite the names of the two new columns
    names_to = "country", 
    values_to = "n_bombs"
  ) %>% 
  # Replace NA values for n_bombs with 0L
  replace_na(list(n_bombs = 0L)) %>% 
  # Plot the number of bombs per country over time
  ggplot(aes(x = year, y = n_bombs, color = country)) +
  geom_line()

Good job! You can see the Cold War nuclear arms race heat up in the sixties and then gradually wane down. Knowing when and how to combine different tidyr functions like pivot_longer() and replace_na() is an important skill best learned through practice.

2.1.2 WHO obesity per country

According to the World Health Organization (WHO), worldwide obesity has nearly tripled since 1975. You’re interested in the severity of this global health issue per country and whether males and females are affected differently. You’ll use the WHO’s obesity data (obesity_df) to investigate this issue. The data holds the percentage of females, males, and both sexes combined that are considered obese (BMI > 30) per country.

You want to create a scatterplot where, per nation, you can see the obesity data colored differently for females and males. This implies that sex should become a variable with its own column.

The ggplot2 package has been pre-loaded for you.

Inspect obesity_df in the console.
Pivot the male and female columns. The old column names should go in the sex column, the original values should go in the pct_obese column.

obesity_df=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRpPo4a7Uj9jAtdmEC3p1YixmewdMQKKsKbuu3CoF4COKX7j-y0FpiRzxLGdaWVmb7mrvWOWEaph558/pub?gid=278516960&single=true&output=csv")
obesity_df %>% 
  # Pivot the male and female columns
  pivot_longer(c(male, female),
               names_to = "sex",
               values_to = "pct_obese")

## # A tibble: 380 × 4
##    country     both_sexes sex    pct_obese
##    <chr>            <dbl> <chr>      <dbl>
##  1 Afghanistan        5.5 male         3.2
##  2 Afghanistan        5.5 female       7.6
##  3 Albania           21.7 male        21.6
##  4 Albania           21.7 female      21.8
##  5 Algeria           27.4 male        19.9
##  6 Algeria           27.4 female      34.9
##  7 Andorra           25.6 male        25.9
##  8 Andorra           25.6 female      25.3
##  9 Angola             8.2 male         4  
## 10 Angola             8.2 female      12.1
## # … with 370 more rows

Create a scatterplot with pct_obese per country colored by sex. The country variable has been ordered by overall obesity % and added for you.

obesity_df %>% 
  # Pivot the male and female columns
  pivot_longer(c(male, female),
               names_to = "sex",
               values_to = "pct_obese") %>% 
  # Create a scatter plot with pct_obese per country colored by sex
  ggplot(aes(x = pct_obese, color = sex,
             y = forcats::fct_reorder(country, both_sexes))) +
  geom_point() +
  scale_y_discrete(breaks = c("India", "Nauru", "Cuba", "Brazil",
                              "Pakistan", "Gabon", "Italy", "Oman",
                              "China", "United States of America")) +
  labs(x = "% Obese", y = "Country")

## Warning: Removed 4 rows containing missing values (geom_point).

Well done! There is a huge spread in obesity prevalence over countries. Overall, females are affected more often. Because you put the sex variable in its own column it became easy to add it to a ggplot.

2.1.3 Bond… James Bond

You’ve been given a James Bond movie dataset (bond_df) and want to visualize the number of movies that Bond actors have featured in per decade. However, the data is in an untidy format with the decade values captured in the column headers. You’ll tidy this dataset to give each variable its own column.

The ggplot2 package has been pre-loaded for you.

Pivot all columns except Bond to a longer format and set the names of the newly created columns to decade and n_movies.

bond_df=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRpPo4a7Uj9jAtdmEC3p1YixmewdMQKKsKbuu3CoF4COKX7j-y0FpiRzxLGdaWVmb7mrvWOWEaph558/pub?gid=595802601&single=true&output=csv")

colnames(bond_df)=c("Bond",1960,    1970,   1980,   1990,   2000,   2010,   2020)
bond_df %>% 
  # Pivot the data to long format and set the column names
  pivot_longer(
    -Bond, 
    names_to = "decade", 
    values_to = "n_movies"
  )

## # A tibble: 49 × 3
##    Bond         decade n_movies
##    <chr>        <chr>     <int>
##  1 Sean Connery 1960          5
##  2 Sean Connery 1970          1
##  3 Sean Connery 1980         NA
##  4 Sean Connery 1990         NA
##  5 Sean Connery 2000         NA
##  6 Sean Connery 2010         NA
##  7 Sean Connery 2020         NA
##  8 David Niven  1960          1
##  9 David Niven  1970         NA
## 10 David Niven  1980         NA
## # … with 39 more rows

Drop any NA values in the n_movies column while it is created.

bond_df %>% 
  # Pivot the data to long format
  pivot_longer(
    -Bond, 
    # Overwrite the names of the two newly created columns
    names_to = "decade", 
    values_to = "n_movies", 
    # Drop na values
    values_drop_na = TRUE
  )

## # A tibble: 12 × 3
##    Bond           decade n_movies
##    <chr>          <chr>     <int>
##  1 Sean Connery   1960          5
##  2 Sean Connery   1970          1
##  3 David Niven    1960          1
##  4 George Lazenby 1960          1
##  5 Roger Moore    1970          4
##  6 Roger Moore    1980          3
##  7 Timothy Dalton 1980          2
##  8 Pierce Brosnan 1990          3
##  9 Pierce Brosnan 2000          1
## 10 Daniel Craig   2000          2
## 11 Daniel Craig   2010          2
## 12 Daniel Craig   2020          1

Transform the decade column data type to integer.

bond_df %>% 
  # Pivot the data to long format
  pivot_longer(
    -Bond, 
    # Overwrite the names of the two newly created columns
    names_to = "decade", 
    values_to = "n_movies", 
    # Drop na values
    values_drop_na = TRUE, 
    # Transform the decade column data type to integer
    names_transform = list(decade = as.integer)
  ) %>% 
  ggplot(aes(x = decade + 5, y = n_movies, fill = Bond))+
  geom_col()

Well done! If you’ve worked with tidyr in the past, you may remember the gather() function. This is a now retired predecessor of the pivot_longer() function.

2.2 Deriving variables from column headers

2.2.1 New-Zealand’s bird of the year

Every year New Zealanders vote en masse to decide which species gets the bird of the year trophy. The contest is organized by the Forest & Bird agency which allows each person to give points to up to five birds (first pick gets 5 points, second 4, …). Your job is to decide this year’s winner from the messy dataset that’s been pre-loaded for you as bird_df.

The dplyr package has been pre-loaded for you.

Inspect bird_df in the console.
Pivot bird_df to longer format so that an integer column points and a character column species are created. Use the names_prefix argument to clean up the points column and make sure no NA values remain.

bird_df=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQi_yI_hSJxlAZpWoqIrp8F9b-g6PLT0snajIporAW2hw8beu90XVaoL_SgfmZ0NAmSRqJyHcCNghPO/pub?gid=1305911804&single=true&output=csv")
bird_df %>%
  # Pivot the data to create a two column data frame
  pivot_longer(
    starts_with("points_"),
    names_to = "points",
    names_prefix = "points_",
    names_transform = list(points = as.integer),
    values_to = "species",
    values_drop_na = TRUE
  )

## # A tibble: 40,000 × 3
##    point_3             points species              
##    <chr>                <int> <chr>                
##  1 Kākā                     5 "Gibson's Albatross" 
##  2 Kākā                     4 "Tūī"                
##  3 Kākā                     2 "Kākāpō"             
##  4 Kākā                     1 "Little Spotted Kiwi"
##  5 Weka                     5 "Spotted Shag"       
##  6 Weka                     4 "Fantail"            
##  7 Weka                     2 ""                   
##  8 Weka                     1 ""                   
##  9 Little Spotted Kiwi      5 "Great Spotted Kiwi" 
## 10 Little Spotted Kiwi      4 "Kea"                
## # … with 39,990 more rows

Calculate the total_points each species got.

bird_df %>%
  # Pivot the data to create a 2 column data frame
  pivot_longer(
    starts_with("points_"),
    names_to = "points",
    names_prefix = "points_",
    names_transform = list(points = as.integer),
    values_to = "species",
    values_drop_na = TRUE
  ) %>%
  group_by(species) %>% 
  summarize(total_points = sum(points)) %>% 
  slice_max(total_points, n = 5)

## # A tibble: 5 × 2
##   species             total_points
##   <chr>                      <int>
## 1 Kākāpō                      7675
## 2 Yellow-eyed penguin         6554
## 3 Black Robin                 4902
## 4 Kererū                      4264
## 5 Banded Dotterel             4162

Good job! A well-deserved win for the penguin!

Photo by veraw04ng

2.2.2 Big tech stock prices

You’re an analyst at an investment firm and want to visualize the weekly closing prices of five big tech firms’ stocks. However, the dataset you’ve been handed (stock_df) is messy and has the year and week variables stored in the column headers. You’ll pivot this data into a tidy format, extract the variables from the headers, and create a line plot.

The ggplot2 package has been pre-loaded for you.

Inspect stock_df in the console.
Pivot stock_df so that the integer columns year and week are created from the column names and the original values are moved to the price column. Use the names_sep argument to separate the column names.

stock_df=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRpPo4a7Uj9jAtdmEC3p1YixmewdMQKKsKbuu3CoF4COKX7j-y0FpiRzxLGdaWVmb7mrvWOWEaph558/pub?gid=1049805716&single=true&output=csv")
colnames(stock_df)=c("company",paste(2019,"_week",seq(1:52),sep=""),paste(2020,"_week",seq(1:53),sep=""))

stock_df %>% 
  # Pivot the data to create 3 new columns: year, week, price
  pivot_longer(
    -company,
    names_to = c("year", "week"),
    values_to = "price",
    names_sep = "_week",
    names_transform = list(
      year = as.integer,
      week = as.integer)
  )

## # A tibble: 525 × 4
##    company  year  week price
##    <chr>   <int> <int> <dbl>
##  1 Amazon   2019     1 1848.
##  2 Amazon   2019     2 1641.
##  3 Amazon   2019     3 1696.
##  4 Amazon   2019     4 1671.
##  5 Amazon   2019     5 1626.
##  6 Amazon   2019     6 1588.
##  7 Amazon   2019     7 1608.
##  8 Amazon   2019     8 1632.
##  9 Amazon   2019     9 1672.
## 10 Amazon   2019    10 1621.
## # … with 515 more rows

Create a line plot where the price is shown per week and color by company. The year variable has been dealt with for you.

stock_df %>% 
  # Pivot the data to create 3 new columns: year, week, price
  pivot_longer(
    -company,
    names_to = c("year", "week"),
    values_to = "price",
    names_sep = "_week",
    names_transform = list(
      year = as.integer,
      week = as.integer)
  ) %>%
  # Create a line plot with price per week, color by company
  ggplot(aes(x = week, y = price, color = company)) +
  geom_line() +
  facet_grid(. ~ year)

Well done! You can now extract multiple variables from column headers. I hope you are starting to see just how powerful the pivot_longer() function is!

2.3 Deriving variables from complex column headers

2.3.1 Soviet space dogs, the dog perspective

You’ll be working on an pre-processed sample of the USSR space dogs database compiled by Duncan Geere and pre-loaded for you as space_dogs_df. Each of the 42 rows in this dataset represents a test rocket launch which had one or two very brave dogs on board.

Your goal is to reshape this dataset so that for each launch, each dog has a row.

The challenge is that in the column headers (name_1, name_2, gender_1, and gender_2), the part before the _ separator can point to two different variables (name and gender), while the second part always points to the dog ID (1^st or 2^nd dog).

Laika, the first animal to orbit Earth in 1957

As the first argument to pivot_longer(), pass the columns to pivot (name_1, name_2, gender_1, and gender_2).
Complete the names_to argument so that the first part of the column headers are reused.
Make sure NA values are dropped since not all rockets had two dogs.

space_dogs_df=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrv7ECpC0Jwv5u4Y5KZFZOGJ2SnEruSs7xoGs_ZPLbLS5CI32JVRLAxpv0dyneZovOOQ_jyaqcZeoh/pub?gid=0&single=true&output=csv")
space_dogs_df %>% 
  pivot_longer(
    # Add the columns to pivot
    name_1:gender_2,
    names_sep = "_",
    # Complete the names_to argument to re-use the first part of the column headers
    names_to = c(".value", "dog_id"),
    # Make sure NA values are dropped
    values_drop_na = TRUE
  )

## # A tibble: 84 × 5
##    date       result                                         dog_id name  gender
##    <chr>      <chr>                                          <chr>  <chr> <chr> 
##  1 1966-02-22 recovered safely after a 22-day flight         1      "Ugo… "Male"
##  2 1966-02-22 recovered safely after a 22-day flight         2      "Vet… "Male"
##  3 1961-03-25 one orbit, recovered safely                    1      "Zve… "Fema…
##  4 1961-03-25 one orbit, recovered safely                    2      ""    ""    
##  5 1961-03-09 one orbit, recovered safely                    1      "Che… "Fema…
##  6 1961-03-09 one orbit, recovered safely                    2      ""    ""    
##  7 1960-12-22 upper stage failed, dogs recovered after a su… 1      "Shu… "Fema…
##  8 1960-12-22 upper stage failed, dogs recovered after a su… 2      "Kom… "Fema…
##  9 1960-12-01 spent one day in orbit, capsule was destroyed… 1      "Mus… "Fema…
## 10 1960-12-01 spent one day in orbit, capsule was destroyed… 2      "Pch… "Fema…
## # … with 74 more rows

Well done! Notice how the “name” and “gender” parts of the old column headers were reused in two new columns. The second part of the old headers is found as a variable in the dog_id column. Since most launches had two dogs, the number of rows almost doubled.

2.3.2 WHO obesity vs. life expectancy

You’ve been given a sample of WHO data (who_df) with obesity percentages and life expectancy data per country, year, and sex. You want to visually inspect the correlation between obesity and life expectancy.

However, the data is very messy with four variables hidden in the column names. Each column name is made up of three parts separated by underscores: Values for the year, followed by those for sex, and then values for either pct.obese or life.exp. Since the third part of the column name string holds two variables you’ll need to use the special “.value” value in the names_to argument.

You’ll pivot the data into a tidy format and create the scatterplot.

The ggplot2 package has been pre-loaded for you.

Inspect who_df in the console.
Pivot the data so that each variable (year, sex, pct.obese, life.exp) has a column of the correct data type.

#who_df <- read.csv("https://raw.githubusercontent.com/Vishal0229/Data605/master/Week12/who.csv", header =T)
# who_df %>% 
#   # Put each variable in its own column
#   pivot_longer(
#     -country,
#     names_to = c("year", "sex", ".value"),
#     names_sep = "_", 
#     names_transform = list("year" = as.integer)
#   )

Use ggplot() to create a scatterplot with life.exp over pct.obese. Color the points by sex.

# who_df %>% 
#   # Put each variable in its own column
#   pivot_longer(
#     -country,
#     names_to = c("year", "sex", ".value"),
#     names_sep = "_", 
#     names_transform = list("year" = as.integer)
#   ) %>%
#   # Create a plot with life expectancy over obesity
#   ggplot(aes(x = pct.obese, y = life.exp, color = sex))+
#   geom_point()

Well done, you’ve transformed an extremely messy dataset with a single call of the pivot_longer() function! Perhaps surprisingly, countries with low life expectancy often have low obesity values even though we expected to see a negative trend. This is probably due to non-obesity related health issues being more important in these countries.

2.3.3 Uncounting observations

You’ve found the job of your dreams providing technical support for a dog breed beauty contest. The jury members want a spreadsheet with the breed and id of each participating dog so that they can add the scores later on. You’ve only been given the number of participants per dog breed (dog_df) so you decide to use your tidyr skills to create the desired result.

Inspect the data in the console.
Uncount the data so that per breed, each dog gets a row and an ID. The ID should go in the dog_id column.

 dog_df=tribble(~breed,            ~n_participants,
"Husky",                         5,
"Golden retriever",             12,
"Poodle",                       13,
"Shiba Inu",                     7)
dog_df %>% 
  # Create one row for each participant and add the id
  uncount(n_participants, .id = "dog_id")

## # A tibble: 37 × 2
##    breed            dog_id
##    <chr>             <int>
##  1 Husky                 1
##  2 Husky                 2
##  3 Husky                 3
##  4 Husky                 4
##  5 Husky                 5
##  6 Golden retriever      1
##  7 Golden retriever      2
##  8 Golden retriever      3
##  9 Golden retriever      4
## 10 Golden retriever      5
## # … with 27 more rows

May the prettiest dog win! Uncounting data isn’t something you’ll do every day, but in specific cases it’s just what you need.

2.4 From long to wide data

2.4.1 Soviet space dogs, the flight perspective

Remember the USSR space dogs dataset¹? You changed it to a long format so that for every dog in every rocket launch, there was a row. Suppose you’re given this tidy dataset and are asked to answer the question, “In what percentage of flights were both dogs of the same gender?”

You’ll reshape and investigate space_dogs_df to find the answer.

The dplyr package has been pre-loaded for you.

¹ Compiled by Duncan Geere.

Pivot the data to a wider format, deriving new column names from the dog_id column and values from the gender column.

space_dogs_df1=space_dogs_df %>% 
  pivot_longer(
    # Add the columns to pivot
    name_1:gender_2,
    names_sep = "_",
    # Complete the names_to argument to re-use the first part of the column headers
    names_to = c(".value", "dog_id"),
    # Make sure NA values are dropped
    values_drop_na = TRUE
  )

space_dogs_df1 %>% 
  # Pivot the data to a wider format
  pivot_wider(
    names_from = dog_id, 
              values_from = gender, 
              names_prefix = 'gender_'
    )

## # A tibble: 84 × 5
##    date       result                                     name  gender_1 gender_2
##    <chr>      <chr>                                      <chr> <chr>    <chr>   
##  1 1966-02-22 recovered safely after a 22-day flight     "Ugo… Male      <NA>   
##  2 1966-02-22 recovered safely after a 22-day flight     "Vet… <NA>     "Male"  
##  3 1961-03-25 one orbit, recovered safely                "Zve… Female    <NA>   
##  4 1961-03-25 one orbit, recovered safely                ""    <NA>     ""      
##  5 1961-03-09 one orbit, recovered safely                "Che… Female    <NA>   
##  6 1961-03-09 one orbit, recovered safely                ""    <NA>     ""      
##  7 1960-12-22 upper stage failed, dogs recovered after … "Shu… Female    <NA>   
##  8 1960-12-22 upper stage failed, dogs recovered after … "Kom… <NA>     "Female"
##  9 1960-12-01 spent one day in orbit, capsule was destr… "Mus… Female    <NA>   
## 10 1960-12-01 spent one day in orbit, capsule was destr… "Pch… <NA>     "Female"
## # … with 74 more rows

Drop rows that contain NA values.

space_dogs_df=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrv7ECpC0Jwv5u4Y5KZFZOGJ2SnEruSs7xoGs_ZPLbLS5CI32JVRLAxpv0dyneZovOOQ_jyaqcZeoh/pub?gid=1588863802&single=true&output=csv")
space_dogs_df %>% 
  # Pivot the data to a wider format
  pivot_wider(names_from = dog_id, values_from = gender, names_prefix = "gender_") %>% 
  # Drop rows with NA values
  drop_na()

## # A tibble: 39 × 3
##     date gender_1 gender_2
##    <int> <chr>    <chr>   
##  1 18831 Male     Male    
##  2 18838 Male     Female  
##  3 18855 Male     Male    
##  4 18859 Male     Male    
##  5 18868 Male     Male    
##  6 18874 Male     Male    
##  7 18805 Female   Male    
##  8 19907 Female   Male    
##  9 19912 Female   Male    
## 10 19931 Female   Male    
## # … with 29 more rows

Create a new column same_gender, which has a TRUE value when gender_1 equals gender_2.

space_dogs_df %>% 
  # Pivot the data to a wider format
  pivot_wider(names_from = dog_id, values_from = gender, names_prefix = "gender_") %>% 
  # Drop rows with NA values
  drop_na() %>% 
  # Create a Boolean column on whether both dogs have the same gender
  mutate(same_gender = gender_1 == gender_2) %>% 
  summarize(pct_same_gender = mean(same_gender))

## # A tibble: 1 × 1
##   pct_same_gender
##             <dbl>
## 1           0.795

Good job! The answer to the question is 80%. By switching back to wide format, it was easy to remove flights with just one dog and calculate per-flight properties. If you’ve been using tidyr for some time, prior to version 1.0, the pivot_wider() function was named spread().

2.4.2 Planet temperature & distance to the Sun

The intensity of light radiated by a light source follows an inverse square relationship with the distance it has traveled. You wonder if you could observe this trend in the temperature of the planets in our Solar System given their distance to the Sun. You’ll use the planet_df dataset from the devstronomy project to investigate this.

Inspect planet_df in the console.
Use the pivot_wider() function to extract column names from the metric column and values from the value column.

planet_df=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/planet_metrics_long.rds")
planet_df %>% 
  # Give each planet variable its own column
  pivot_wider(names_from = "metric", values_from = "value")

## # A tibble: 8 × 4
##   planet  diameter distance_to_sun temperature
##   <chr>      <dbl>           <dbl>       <dbl>
## 1 Mercury     4879            57.9         167
## 2 Venus      12104           108.          464
## 3 Earth      12756           150.           15
## 4 Mars        6792           228.          -65
## 5 Jupiter   142984           779.         -110
## 6 Saturn    120536          1434.         -140
## 7 Uranus     51118          2872.         -195
## 8 Neptune    49528          4495.         -200

Use the ggplot() function to create a plot with the temperature over the distance_to_sun.

planet_df %>% 
  # Give each planet variable its own column
  pivot_wider(names_from = "metric", values_from = "value") %>% 
  # Plot planet temperature over distance to sun
  ggplot(aes(x = distance_to_sun, y = temperature)) +
  geom_point(aes(size = diameter)) +
  geom_text(aes(label = planet), vjust = -1) +
  labs(x = "Distance to sun (million km)", 
       y = "Mean temperature (°C)") +
  theme(legend.position = "none")

Stellar! The inverse square relation is present indeed. Only Venus doesn’t follow the trend as it is further from the Sun than Mercury but also hotter. Turns out it has a lot of CO² in its atmosphere trapping the heat. Let’s not become like Venus ;)

2.4.3 Transposing planet data

You’re again working on a planet dataset derived from the devstronomy project. This time, you’re interested in the correlation between the diameter of a planet and the number of moons circling it.

However, the dataset (planet_df) has a row for each variable and a column for each planet (observation). You’ll transpose this data in two steps and then create a plot to inspect the correlation.

The ggplot2 package has been pre-loaded for you.

Inspect planet_df in the console.
Pivot the data so that planet names are put in a column named planet.

planet_df=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/planet_metrics_wide.rds")

planet_df %>%
  # Pivot all columns except metric to long format
  pivot_longer(-metric, names_to = "planet")

## # A tibble: 40 × 3
##    metric   planet      value
##    <chr>    <chr>       <dbl>
##  1 mass     Mercury     0.33 
##  2 mass     Venus       4.87 
##  3 mass     Earth       5.97 
##  4 mass     Mars        0.642
##  5 mass     Jupiter  1898    
##  6 mass     Saturn    568    
##  7 mass     Uranus     86.8  
##  8 mass     Neptune   102    
##  9 diameter Mercury  4879    
## 10 diameter Venus   12104    
## # … with 30 more rows

Pivot the data so that each variable in the metric column gets its own column.

planet_df %>%
  # Pivot all columns except metric to long format
  pivot_longer(-metric, names_to = "planet") %>% 
  # Put each metric in its own column
  pivot_wider(names_from = metric, values_from = value)

## # A tibble: 8 × 6
##   planet      mass diameter distance_from_sun gravity number_of_moons
##   <chr>      <dbl>    <dbl>             <dbl>   <dbl>           <dbl>
## 1 Mercury    0.33      4879              57.9     3.7               0
## 2 Venus      4.87     12104             108.      8.9               0
## 3 Earth      5.97     12756             150.      9.8               1
## 4 Mars       0.642     6792             228.      3.7               2
## 5 Jupiter 1898       142984             779.     23.1              79
## 6 Saturn   568       120536            1434.      9                62
## 7 Uranus    86.8      51118            2872.      8.7              27
## 8 Neptune  102        49528            4495.     11                14

Use the ggplot() function to create a plot with the number_of_moons over diameter.

planet_df %>%
  # Pivot all columns except metric to long format
  pivot_longer(-metric, names_to = "planet") %>% 
  # Put each metric in its own column
  pivot_wider(names_from = metric, values_from = value) %>% 
  # Plot the number of moons vs planet diameter
  ggplot(aes(x = diameter, y = number_of_moons)) +
  geom_point(aes(size = diameter)) +
  geom_text(aes(label = planet), vjust = -1) +
  labs(x = "Diameter (km)", y = "Number of moons") +
  theme(legend.position = "none")

Well done! It’s clear that bigger planets tend to have more moons. Once you’re able to split the messiness of a dataset into smaller, easier to fix problems, you’ve become a data wrangling master.

3 Expanding data

Values can often be missing in your data, and sometimes entire observations are absent too. In this chapter, you’ll learn how to complete your dataset with these missing observations. You’ll add observations with zero values to counted data, expand time series to a full sequence of intervals, and more!

3.1 Creating unique combinations of vectors

3.1.1 Letters of the genetic code

The basic building blocks of RNA are four molecules described by a single letter each: adenine (A), cytosine (C), guanine (G), and uracil (U). The information carried by an RNA strand can be represented as a long sequence of these four letters. To read this code, one has to divide this chain into sequences of three letters each (e.g. GCU, ACG, …). These three letter sequences are known as codons. The concept is illustrated in the image below.

Your goal for this exercise is to create a data frame with all possible three letter sequences (codons) from a vector with the four letters representing the RNA building blocks.

Create a tibble with three columns called letter1, letter2, and letter3 that holds all possible combinations of the vector letters using expand_grid().

letters <- c("A", "C", "G", "U")

# Create a tibble with all possible 3 way combinations
codon_df <- expand_grid(
  letter1 = letters,
  letter2 = letters,
  letter3 = letters
)

codon_df

## # A tibble: 64 × 3
##    letter1 letter2 letter3
##    <chr>   <chr>   <chr>  
##  1 A       A       A      
##  2 A       A       C      
##  3 A       A       G      
##  4 A       A       U      
##  5 A       C       A      
##  6 A       C       C      
##  7 A       C       G      
##  8 A       C       U      
##  9 A       G       A      
## 10 A       G       C      
## # … with 54 more rows

Use the unite() function from chapter one to merge these three columns into a single column named codon. Use an empty string as the separator.

letters <- c("A", "C", "G", "U")

# Create a tibble with all possible 3 way combinations
codon_df <- expand_grid(
  letter1 = letters,
  letter2 = letters,
  letter3 = letters
)

codon_df %>%  
  # Unite these three columns into a "codon" column
  unite("codon", letter1:letter3, sep= "")

## # A tibble: 64 × 1
##    codon
##    <chr>
##  1 AAA  
##  2 AAC  
##  3 AAG  
##  4 AAU  
##  5 ACA  
##  6 ACC  
##  7 ACG  
##  8 ACU  
##  9 AGA  
## 10 AGC  
## # … with 54 more rows

Well done! There are 64 possible codons (64=4³). They encode 20 amino acids, the building blocks of proteins and, thus, you!

3.1.2 When did humans replace dogs in space?

You already know that in the early days of spaceflight, the USSR was testing rockets with dogs. You now wonder when exactly humans started replacing dogs on space flight missions. You’ve been given a dataset space_df with the number of both dogs (compiled by Duncan Geere) and humans in space per year from 1951 till 1970 (collected from Wikipedia).

Your goal is to create a plot that shows you the number of individuals sent into space per species. Before you can create this plot, you’ll first have to introduce zero values for missing combinations of year and species.

The dplyr and ggplot2 packages have been pre-loaded for you.

Create full_df, a tibble with all unique combinations of the variables year (from 1951 to 1970) and species (“Human” and “Dog”).

# Create a tibble with all combinations of years and species
full_df <- expand_grid(
  year = 1951:1970, 
  species = c("Human", "Dog")
)

full_df

## # A tibble: 40 × 2
##     year species
##    <int> <chr>  
##  1  1951 Human  
##  2  1951 Dog    
##  3  1952 Human  
##  4  1952 Dog    
##  5  1953 Human  
##  6  1953 Dog    
##  7  1954 Human  
##  8  1954 Dog    
##  9  1955 Human  
## 10  1955 Dog    
## # … with 30 more rows

Perform a right_join() between space_df and full_df on the year and species columns.

space_df=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrv7ECpC0Jwv5u4Y5KZFZOGJ2SnEruSs7xoGs_ZPLbLS5CI32JVRLAxpv0dyneZovOOQ_jyaqcZeoh/pub?gid=1478550685&single=true&output=csv")
space_df %>% 
  # Join with full_df so that missing values are introduced
  right_join(full_df, by = c("year", "species")) %>% 
  arrange(year)

##    year species n_in_space
## 1  1951     Dog         14
## 2  1951   Human         NA
## 3  1952   Human         NA
## 4  1952     Dog         NA
## 5  1953   Human         NA
## 6  1953     Dog         NA
## 7  1954     Dog          6
## 8  1954   Human         NA
## 9  1955     Dog          6
## 10 1955   Human         NA
## 11 1956     Dog          8
## 12 1956   Human         NA
## 13 1957     Dog         11
## 14 1957   Human         NA
## 15 1958     Dog         12
## 16 1958   Human         NA
## 17 1959     Dog          4
## 18 1959   Human         NA
## 19 1960     Dog         16
## 20 1960   Human         NA
## 21 1961     Dog          2
## 22 1961   Human          4
## 23 1962   Human          5
## 24 1962     Dog         NA
## 25 1963   Human          5
## 26 1963     Dog         NA
## 27 1964   Human          3
## 28 1964     Dog         NA
## 29 1965   Human         12
## 30 1965     Dog         NA
## 31 1966     Dog          2
## 32 1966   Human         10
## 33 1967   Human          1
## 34 1967     Dog         NA
## 35 1968   Human          7
## 36 1968     Dog         NA
## 37 1969   Human         23
## 38 1969     Dog         NA
## 39 1970   Human          5
## 40 1970     Dog         NA

Use the ggplot() function to create a line plot of n_in_space over year, colored by species.

space_df %>% 
  # Join with full_df so that missing values are introduced
  right_join(full_df, by = c("year", "species")) %>% 
  # Create a line plot with n_in_space over year, color by species
  ggplot(aes(x = year, y = n_in_space, color = species)) +
  geom_line()

## Warning: Removed 14 row(s) containing missing values (geom_path).

Use the replace_na() function to overwrite NA values in the n_in_space column with zeros.

space_df %>% 
  # Join with full_df so that missing values are introduced
  right_join(full_df, by = c("year", "species")) %>% 
  # Overwrite NA values for n_in_space with 0L
  replace_na(list(n_in_space = 0L)) %>% 
  # Create a line plot with n_in_space over year, color by species
  ggplot(aes(x = year, y = n_in_space, color = species)) +
  geom_line()

Good job! It looks like 1961 was the year humans replaced dogs in space. Adding zero observations allowed you to create a correct and complete plot.

3.1.3 Finding missing observations

You’re an inspector at a nuclear plant and have to validate whether every reactor has received its daily safety check over the course of a full year. The safety check logs are in reactor_df, a data frame with columns date, reactor, and check.

Two vectors, dates and reactors, with all dates of the year and reactors at the plant respectively have been created for you. You’ll use the combination of the expand_grid() and anti_join() functions to find dates where particular reactors were not checked.

The dplyr package has been pre-loaded for you.

Use the expand_grid() function to create a tibble holding all combinations of the variables date and reactor. Use the dates and reactors vectors created for you.

dates=seq(as.Date("1986-01-01"), as.Date("1986-12-31"), "days")
reactors=c("A","B","C","D")
# Create a tibble with all combinations of dates and reactors
full_df <- expand_grid(date = dates, reactor = reactors)

Perform an anti-join between full_df and reactor_df on the date and reactor columns.

# Find the reactor - date combinations not present in reactor_df
#full_df %>% 
#  anti_join(reactor_df, by = c("date", "reactor"))

Well done, looks like safety was not optimal in April 1986. Sometimes the observations not in the data are more important than the ones present. The trick you just learned can help you find them.

3.2 Completing data with all value combinations

3.2.1 Completing the Solar System

You have been given a data frame (planet_df) from the devstronomy project with the number of moons per planet in our Solar System. However, Mercury and Venus, the two moonless planets, are absent. You want to expand this dataset using the complete() function and a vector planets that contains all eight planet’s names.

Complete the planet variable using the planets vector.

planets = c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")

Replace NA values in the n_moons variable with 0L values.

planet_df=tribble(~planet, ~n_moons,
"Earth", 1,
"Mars", 2,
"Jupiter", 79,
"Saturn", 62,
"Uranus", 27,
"Neptune", 14)
planet_df %>% 
  complete(
    # Complete the planet variable
    planet = planets,
    # Overwrite NA values for n_moons with 0L
    fill = list(n_moons = 0L)
  )

## # A tibble: 8 × 2
##   planet  n_moons
##   <chr>     <dbl>
## 1 Earth         1
## 2 Jupiter      79
## 3 Mars          2
## 4 Mercury       0
## 5 Neptune      14
## 6 Saturn       62
## 7 Uranus       27
## 8 Venus         0

Well done! By completing with a vector of all known planets, you can be sure that each will have an observation in the output. Did you know Jupiter is circled by at least 79 moons? This number is likely to increase as new ones continue to be discovered every few years.

3.2.2 Zero Olympic medals

Since 1896, athletes from all over the world have been competing in the modern Olympic games. You’ve been given a dataset (medal_df) with observations for all medals won by athletes from the 10 most successful countries in Olympic history. You want to create a visual with the number of medals won per country (team) per year. However, since not all countries won medals each year, you’ll have to introduce zero values before you can make an accurate visual.

The ggplot2 and dplyr packages have been pre-loaded for you. In step 2 and 3 the scale_color_brewer() function is used to color lines in the plot with a palette that makes it easier to distinguish the different countries.

Inspect medal_df in the console.
Count the number of medals won per team and year.

medal_df=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/olympic_medals_top_10.rds")
medal_df %>% 
  # Count the medals won per team and year
  count(team, year, name = "n_medals")

## # A tibble: 231 × 3
##    team       year n_medals
##    <chr>     <dbl>    <int>
##  1 Australia  1896        2
##  2 Australia  1900        5
##  3 Australia  1904        4
##  4 Australia  1906        3
##  5 Australia  1920        7
##  6 Australia  1924       10
##  7 Australia  1928        4
##  8 Australia  1932        5
##  9 Australia  1936        1
## 10 Australia  1948       16
## # … with 221 more rows

Use ggplot() to create a line plot with n_medals over year, colored by team.

medal_df %>% 
  # Count the medals won per team and year
  count(team, year, name = "n_medals") %>% 
  # Plot n_medals over year, colored by team
  ggplot(aes(x = year, y = n_medals, color = team)) +
  geom_line() +
  scale_color_brewer(palette = "Paired")

Complete the team and year variables, replace NA values in the n_medals column with zeros.

medal_df %>% 
  # Count the medals won per team and year
  count(team, year, name = "n_medals") %>% 
  # Complete the team and year variables, fill n_medals with zeros
  complete(team, year, fill = list(n_medals = 0)) %>% 
  # Plot n_medals over year, colored by team
  ggplot(aes(x = year, y = n_medals, color = team)) +
  geom_line() +
  scale_color_brewer(palette = "Paired")

Well done! Zero values are often absent in counted data which can lead to buggy visualizations. We can now see the USA boycott of the 1980 Olympics in Moscow and the subsequent USSR boycott of the 1984 Olympics in Los Angeles.

3.2.3 Creating a sequence with full_seq()

The full_seq() function will look for the minimal and maximal values inside the vector you pass it and will then generate a full sequence of numbers with a fixed period in between them. When used inside the complete() function, full_seq() is a handy tool to make sure there are no missing observations in your data. Before combining these two functions you’ll generate a few sequences with full_seq() on its own to get the hang of this function.

Use full_seq() to create a sequence with all years from 2020 till 2030.

# Generate all years from 2020 to 2030
years <- full_seq(c(2020, 2030), period = 1)
years

##  [1] 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030

Use full_seq() to create a sequence with all decades from 1980 till 2030.

# Generate all decades from 1980 to 2030
decades <- full_seq(c(1980, 2030), period = 10)
decades

## [1] 1980 1990 2000 2010 2020 2030

Use full_seq() to create a sequence with all dates in 1980 using the outer_dates vector.

outer_dates <- c(as.Date("1980-01-01"), as.Date("1980-12-31"))

# Generate the dates for all days in 1980
full_seq(outer_dates, period = 1)

##   [1] "1980-01-01" "1980-01-02" "1980-01-03" "1980-01-04" "1980-01-05"
##   [6] "1980-01-06" "1980-01-07" "1980-01-08" "1980-01-09" "1980-01-10"
##  [11] "1980-01-11" "1980-01-12" "1980-01-13" "1980-01-14" "1980-01-15"
##  [16] "1980-01-16" "1980-01-17" "1980-01-18" "1980-01-19" "1980-01-20"
##  [21] "1980-01-21" "1980-01-22" "1980-01-23" "1980-01-24" "1980-01-25"
##  [26] "1980-01-26" "1980-01-27" "1980-01-28" "1980-01-29" "1980-01-30"
##  [31] "1980-01-31" "1980-02-01" "1980-02-02" "1980-02-03" "1980-02-04"
##  [36] "1980-02-05" "1980-02-06" "1980-02-07" "1980-02-08" "1980-02-09"
##  [41] "1980-02-10" "1980-02-11" "1980-02-12" "1980-02-13" "1980-02-14"
##  [46] "1980-02-15" "1980-02-16" "1980-02-17" "1980-02-18" "1980-02-19"
##  [51] "1980-02-20" "1980-02-21" "1980-02-22" "1980-02-23" "1980-02-24"
##  [56] "1980-02-25" "1980-02-26" "1980-02-27" "1980-02-28" "1980-02-29"
##  [61] "1980-03-01" "1980-03-02" "1980-03-03" "1980-03-04" "1980-03-05"
##  [66] "1980-03-06" "1980-03-07" "1980-03-08" "1980-03-09" "1980-03-10"
##  [71] "1980-03-11" "1980-03-12" "1980-03-13" "1980-03-14" "1980-03-15"
##  [76] "1980-03-16" "1980-03-17" "1980-03-18" "1980-03-19" "1980-03-20"
##  [81] "1980-03-21" "1980-03-22" "1980-03-23" "1980-03-24" "1980-03-25"
##  [86] "1980-03-26" "1980-03-27" "1980-03-28" "1980-03-29" "1980-03-30"
##  [91] "1980-03-31" "1980-04-01" "1980-04-02" "1980-04-03" "1980-04-04"
##  [96] "1980-04-05" "1980-04-06" "1980-04-07" "1980-04-08" "1980-04-09"
## [101] "1980-04-10" "1980-04-11" "1980-04-12" "1980-04-13" "1980-04-14"
## [106] "1980-04-15" "1980-04-16" "1980-04-17" "1980-04-18" "1980-04-19"
## [111] "1980-04-20" "1980-04-21" "1980-04-22" "1980-04-23" "1980-04-24"
## [116] "1980-04-25" "1980-04-26" "1980-04-27" "1980-04-28" "1980-04-29"
## [121] "1980-04-30" "1980-05-01" "1980-05-02" "1980-05-03" "1980-05-04"
## [126] "1980-05-05" "1980-05-06" "1980-05-07" "1980-05-08" "1980-05-09"
## [131] "1980-05-10" "1980-05-11" "1980-05-12" "1980-05-13" "1980-05-14"
## [136] "1980-05-15" "1980-05-16" "1980-05-17" "1980-05-18" "1980-05-19"
## [141] "1980-05-20" "1980-05-21" "1980-05-22" "1980-05-23" "1980-05-24"
## [146] "1980-05-25" "1980-05-26" "1980-05-27" "1980-05-28" "1980-05-29"
## [151] "1980-05-30" "1980-05-31" "1980-06-01" "1980-06-02" "1980-06-03"
## [156] "1980-06-04" "1980-06-05" "1980-06-06" "1980-06-07" "1980-06-08"
## [161] "1980-06-09" "1980-06-10" "1980-06-11" "1980-06-12" "1980-06-13"
## [166] "1980-06-14" "1980-06-15" "1980-06-16" "1980-06-17" "1980-06-18"
## [171] "1980-06-19" "1980-06-20" "1980-06-21" "1980-06-22" "1980-06-23"
## [176] "1980-06-24" "1980-06-25" "1980-06-26" "1980-06-27" "1980-06-28"
## [181] "1980-06-29" "1980-06-30" "1980-07-01" "1980-07-02" "1980-07-03"
## [186] "1980-07-04" "1980-07-05" "1980-07-06" "1980-07-07" "1980-07-08"
## [191] "1980-07-09" "1980-07-10" "1980-07-11" "1980-07-12" "1980-07-13"
## [196] "1980-07-14" "1980-07-15" "1980-07-16" "1980-07-17" "1980-07-18"
## [201] "1980-07-19" "1980-07-20" "1980-07-21" "1980-07-22" "1980-07-23"
## [206] "1980-07-24" "1980-07-25" "1980-07-26" "1980-07-27" "1980-07-28"
## [211] "1980-07-29" "1980-07-30" "1980-07-31" "1980-08-01" "1980-08-02"
## [216] "1980-08-03" "1980-08-04" "1980-08-05" "1980-08-06" "1980-08-07"
## [221] "1980-08-08" "1980-08-09" "1980-08-10" "1980-08-11" "1980-08-12"
## [226] "1980-08-13" "1980-08-14" "1980-08-15" "1980-08-16" "1980-08-17"
## [231] "1980-08-18" "1980-08-19" "1980-08-20" "1980-08-21" "1980-08-22"
## [236] "1980-08-23" "1980-08-24" "1980-08-25" "1980-08-26" "1980-08-27"
## [241] "1980-08-28" "1980-08-29" "1980-08-30" "1980-08-31" "1980-09-01"
## [246] "1980-09-02" "1980-09-03" "1980-09-04" "1980-09-05" "1980-09-06"
## [251] "1980-09-07" "1980-09-08" "1980-09-09" "1980-09-10" "1980-09-11"
## [256] "1980-09-12" "1980-09-13" "1980-09-14" "1980-09-15" "1980-09-16"
## [261] "1980-09-17" "1980-09-18" "1980-09-19" "1980-09-20" "1980-09-21"
## [266] "1980-09-22" "1980-09-23" "1980-09-24" "1980-09-25" "1980-09-26"
## [271] "1980-09-27" "1980-09-28" "1980-09-29" "1980-09-30" "1980-10-01"
## [276] "1980-10-02" "1980-10-03" "1980-10-04" "1980-10-05" "1980-10-06"
## [281] "1980-10-07" "1980-10-08" "1980-10-09" "1980-10-10" "1980-10-11"
## [286] "1980-10-12" "1980-10-13" "1980-10-14" "1980-10-15" "1980-10-16"
## [291] "1980-10-17" "1980-10-18" "1980-10-19" "1980-10-20" "1980-10-21"
## [296] "1980-10-22" "1980-10-23" "1980-10-24" "1980-10-25" "1980-10-26"
## [301] "1980-10-27" "1980-10-28" "1980-10-29" "1980-10-30" "1980-10-31"
## [306] "1980-11-01" "1980-11-02" "1980-11-03" "1980-11-04" "1980-11-05"
## [311] "1980-11-06" "1980-11-07" "1980-11-08" "1980-11-09" "1980-11-10"
## [316] "1980-11-11" "1980-11-12" "1980-11-13" "1980-11-14" "1980-11-15"
## [321] "1980-11-16" "1980-11-17" "1980-11-18" "1980-11-19" "1980-11-20"
## [326] "1980-11-21" "1980-11-22" "1980-11-23" "1980-11-24" "1980-11-25"
## [331] "1980-11-26" "1980-11-27" "1980-11-28" "1980-11-29" "1980-11-30"
## [336] "1980-12-01" "1980-12-02" "1980-12-03" "1980-12-04" "1980-12-05"
## [341] "1980-12-06" "1980-12-07" "1980-12-08" "1980-12-09" "1980-12-10"
## [346] "1980-12-11" "1980-12-12" "1980-12-13" "1980-12-14" "1980-12-15"
## [351] "1980-12-16" "1980-12-17" "1980-12-18" "1980-12-19" "1980-12-20"
## [356] "1980-12-21" "1980-12-22" "1980-12-23" "1980-12-24" "1980-12-25"
## [361] "1980-12-26" "1980-12-27" "1980-12-28" "1980-12-29" "1980-12-30"
## [366] "1980-12-31"

Well done! Now that you know how the full_seq() function works, you can start plugging it into the complete() function.

3.2.4 The Cold War’s hottest year

In October 1962, during the Cuban missile crisis, the world came close to a full scale nuclear war. Throughout 1962, the USA, USSR, and France together detonated a record 178 nuclear bombs for military power display and research. You’ve been given a sample of the Nuclear Explosion Database (NEDB) for that year (cumul_nukes_1962_df) with an observation for each date on which a bomb was detonated. The total_bombs variable contains the cumulative number of bombs detonated by a country up to that point in time.

You’ll complete the dataset to hold the full sequence of dates, and visualize the total number of bombs per country over time. You’ll also use the fill() function from Chapter One to impute missing values.

The dplyr and ggplot2 packages have been pre-loaded for you.

Complete the dataset so that for each country there is an observation of each date using the full_seq() function.

cumul_nukes_1962_df=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/nukes_1962.rds")
cumul_nukes_1962_df %>% 
  # Complete the dataset
  complete(country, date = full_seq(date, period = 1))

## # A tibble: 1,077 × 3
##    country date       total_bombs
##    <chr>   <date>           <int>
##  1 France  1962-01-01           0
##  2 France  1962-01-02          NA
##  3 France  1962-01-03          NA
##  4 France  1962-01-04          NA
##  5 France  1962-01-05          NA
##  6 France  1962-01-06          NA
##  7 France  1962-01-07          NA
##  8 France  1962-01-08          NA
##  9 France  1962-01-09          NA
## 10 France  1962-01-10          NA
## # … with 1,067 more rows

Group the data by country.
Use the fill() function to overwrite NA values in the total_bombs variable with the last known value.

cumul_nukes_1962_df %>% 
  # Complete the dataset
  complete(country, date = full_seq(date, period = 1)) %>% 
  # Group the data by country
  group_by(country) %>% 
  # Impute missing values with the last known observation
  fill(total_bombs)

## # A tibble: 1,077 × 3
## # Groups:   country [3]
##    country date       total_bombs
##    <chr>   <date>           <int>
##  1 France  1962-01-01           0
##  2 France  1962-01-02           0
##  3 France  1962-01-03           0
##  4 France  1962-01-04           0
##  5 France  1962-01-05           0
##  6 France  1962-01-06           0
##  7 France  1962-01-07           0
##  8 France  1962-01-08           0
##  9 France  1962-01-09           0
## 10 France  1962-01-10           0
## # … with 1,067 more rows

Use ggplot() to visualize the total_bombs at any given date, color the line plot by country. Some code has been added for you to visualize the Cuban Missile Crisis.

cumul_nukes_1962_df %>% 
  # Complete the dataset
  complete(country, date = full_seq(date, period = 1)) %>% 
  # Group the data by country
  group_by(country) %>% 
  # Impute missing values with the last known observation
  fill(total_bombs) %>% 
  # Plot the number of bombs over time, color by country
  ggplot(aes(date, total_bombs, color = country)) +
  # These two lines will mark the Cuban Missile Crisis 
  geom_rect(xmin = as.Date("1962-10-16"), xmax = as.Date("1962-10-29"), ymin = -Inf, ymax = Inf, color = NA)+ 
  geom_text(x = as.Date("1962-10-22"), y = 15, label = "Cuban Missile Crisis", angle = 90, color = "white")+
  geom_line()

Good job! Looks like the USSR only started running experiments in the second part of the year. The full_seq() function allowed you to add dates not yet seen in the data.

3.3 Advanced completions

3.3.1 Olympic medals per continent

You want to compare Olympic performance of athletes per continent over time, both on the winter and summer Olympics. You’ve been given a dataset medal_df with the average number of medals won per participant of each continent since 1928. You’ll complete this data to introduce zero values for years where a continent did not win any medals.

The ggplot2 package has been pre-loaded for you.

Complete the dataset so that each continent has a medals_per_participant value at each Olympic event. Missing values should be filled with zeros.
Nest the season and year variables using the nesting() function, since the summer and winter Olympics don’t occur in the same years.
Use ggplot() to create a line plot with the medals_per_participant per year, color the plot by continent.

medal_df=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQ7ryL3YFNP4S9bw-F31Ejavx2ZJIBM5yWWVclfVSONfXMeTzaJaIe8yeCkma0FJuudkTUbIbRyWlDz/pub?gid=1081899654&single=true&output=csv")
medal_df %>% 
  # Give each continent an observation at each Olympic event
  complete(
    continent, 
    nesting(season, year), 
    fill = list(medals_per_participant = 0)
  ) %>%
  # Plot the medals_per_participant over time, colored by continent
  ggplot(aes(x = year, y = medals_per_participant, color = continent)) +
  geom_line() +
  facet_grid(season ~ .)

Well done! Since the nineties, North American athletes have been most proficient, especially in the winter Olympics. Nesting variables while completing is important to avoid adding nonsensical observations to your data, like summer Olympics in years where they did not happen.

3.3.2 Tracking a virus outbreak

You’re a doctor in a remote village confronted with a virus outbreak. You have been collecting data on when your patients got infected and recovered in a data frame named patient_df. Your goal is to create a visual with the number of sick patients over time. You’ll first have to reshape the data so that you can count the number of sick patients per day.

The dplyr and ggplot2 packages have been pre-loaded for you.

Inspect patient_df in the console.
Pivot the infected and recovered columns to long format, the old column names should go in the status variable, the values to date.

patient_df=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQ7ryL3YFNP4S9bw-F31Ejavx2ZJIBM5yWWVclfVSONfXMeTzaJaIe8yeCkma0FJuudkTUbIbRyWlDz/pub?gid=1724949310&single=true&output=csv")
patient_df=patient_df %>% mutate(infected=as.Date(infected),recovered=as.Date(recovered))
patient_df %>% 
  # Pivot the infected and recovered columns to long format
  pivot_longer(-patient, names_to = "status", values_to = "date")

## # A tibble: 52 × 3
##    patient status    date      
##    <chr>   <chr>     <date>    
##  1 A       infected  2020-02-08
##  2 A       recovered 2020-03-28
##  3 B       infected  2020-02-19
##  4 B       recovered 2020-03-04
##  5 C       infected  2020-02-27
##  6 C       recovered 2020-03-19
##  7 D       infected  2020-03-05
##  8 D       recovered 2020-03-19
##  9 E       infected  2020-03-06
## 10 E       recovered 2020-03-29
## # … with 42 more rows

Group the data by patient and then complete the date column so that each date between infection and recovery is added using the full_seq() column. At the end, ungroup the data.

patient_df %>% 
  # Pivot the infected and recovered columns to long format
  pivot_longer(-patient, names_to = "status", values_to = "date") %>% 
  select(-status) %>% 
  # Group by patient
  group_by(patient) %>% 
  # Complete the date range per patient using full_seq()
  complete(date = full_seq(date, period = 1)) %>% 
  # Ungroup the data
  ungroup()

## # A tibble: 684 × 2
##    patient date      
##    <chr>   <date>    
##  1 A       2020-02-08
##  2 A       2020-02-09
##  3 A       2020-02-10
##  4 A       2020-02-11
##  5 A       2020-02-12
##  6 A       2020-02-13
##  7 A       2020-02-14
##  8 A       2020-02-15
##  9 A       2020-02-16
## 10 A       2020-02-17
## # … with 674 more rows

Each date is now a day on which a patient was sick, count the dates and name the new variable n_sick.

patient_df %>% 
  # Pivot the infected and recovered columns to long format
  pivot_longer(-patient, names_to = "status", values_to = "date") %>% 
  select(-status) %>% 
  # Group by patient
  group_by(patient) %>% 
  # Complete the date range per patient using full_seq()
  complete(date = full_seq(date, period = 1)) %>% 
  # Ungroup the data
  ungroup() %>% 
  # Count the dates, the count goes in the n_sick variable
  count(date, name = "n_sick") %>% 
  ggplot(aes(x = date, y = n_sick))+
  geom_line()

Good job! Combining group_by() with tidyr functions further broadens the range of manipulations you can do.

3.3.3 Counting office occupants

Imagine you’re an office facility manager and want to know how many people are present throughout the day. You’ve installed a sensor at the entrance that counts the number of people entering and leaving the building. The sensor sends an update at the end of every 20 minute time slot if at least one person passed.

To create a dataset ready for visualization, you’ll combine the different techniques you’ve learned so far.

The dplyr and ggplot2 packages have been pre-loaded for you.

Complete the time variable by using the seq() function to create a sequence between the min and max values with an interval of “20 min”. Fill NA values of enter and exit with 0L.

sensor_df=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQ7ryL3YFNP4S9bw-F31Ejavx2ZJIBM5yWWVclfVSONfXMeTzaJaIe8yeCkma0FJuudkTUbIbRyWlDz/pub?gid=1171603308&single=true&output=csv")
sensor_df=sensor_df %>% mutate(time=as.POSIXct(time))
sensor_df %>% 
  # Complete the time column with a 20 minute interval
  complete(time = seq(min(time), max(time), by = "20 min"),
           fill = list(enter = 0L, exit = 0L))

## # A tibble: 36 × 4
##    time                   id enter  exit
##    <dttm>              <int> <int> <int>
##  1 2020-01-01 07:00:00     1     1     0
##  2 2020-01-01 07:20:00     2     2     0
##  3 2020-01-01 07:40:00     3     3    -1
##  4 2020-01-01 08:00:00     4     5     0
##  5 2020-01-01 08:20:00     5     3     0
##  6 2020-01-01 08:40:00     6     4    -1
##  7 2020-01-01 09:00:00     7     4     0
##  8 2020-01-01 09:20:00     8     1     0
##  9 2020-01-01 09:40:00     9     1    -1
## 10 2020-01-01 10:00:00    NA     0     0
## # … with 26 more rows

Calculate the total_inside variable by taking the cumulative sum of enter plus exit using the cumsum() function.

sensor_df %>% 
  # Complete the time column with a 20 minute interval
  complete(time = seq(min(time), max(time), by = "20 min"),
           fill = list(enter = 0L, exit = 0L)) %>%
  # Calculate the total number of people inside
  mutate(total_inside = cumsum(enter + exit))

## # A tibble: 36 × 5
##    time                   id enter  exit total_inside
##    <dttm>              <int> <int> <int>        <int>
##  1 2020-01-01 07:00:00     1     1     0            1
##  2 2020-01-01 07:20:00     2     2     0            3
##  3 2020-01-01 07:40:00     3     3    -1            5
##  4 2020-01-01 08:00:00     4     5     0           10
##  5 2020-01-01 08:20:00     5     3     0           13
##  6 2020-01-01 08:40:00     6     4    -1           16
##  7 2020-01-01 09:00:00     7     4     0           20
##  8 2020-01-01 09:20:00     8     1     0           21
##  9 2020-01-01 09:40:00     9     1    -1           21
## 10 2020-01-01 10:00:00    NA     0     0           21
## # … with 26 more rows

Pivot the enter and exit columns to long format. The column names should go in the direction variable, the values in n_people.

sensor_df %>% 
  # Complete the time column with a 20 minute interval
  complete(time = seq(min(time), max(time), by = "20 min"),
           fill = list(enter = 0L, exit = 0L)) %>%
  # Calculate the total number of people inside
  mutate(total_inside = cumsum(enter + exit)) %>% 
  # Pivot the enter and exit columns to long format
  pivot_longer(enter:exit, names_to = "direction", values_to = "n_people")

## # A tibble: 72 × 5
##    time                   id total_inside direction n_people
##    <dttm>              <int>        <int> <chr>        <int>
##  1 2020-01-01 07:00:00     1            1 enter            1
##  2 2020-01-01 07:00:00     1            1 exit             0
##  3 2020-01-01 07:20:00     2            3 enter            2
##  4 2020-01-01 07:20:00     2            3 exit             0
##  5 2020-01-01 07:40:00     3            5 enter            3
##  6 2020-01-01 07:40:00     3            5 exit            -1
##  7 2020-01-01 08:00:00     4           10 enter            5
##  8 2020-01-01 08:00:00     4           10 exit             0
##  9 2020-01-01 08:20:00     5           13 enter            3
## 10 2020-01-01 08:20:00     5           13 exit             0
## # … with 62 more rows

Use ggplot() to visualize the n_people in the building over time. Use the fill argument to color the area plot by direction.

sensor_df %>% 
  # Complete the time column with a 20 minute interval
  complete(time = seq(min(time), max(time), by = "20 min"),
           fill = list(enter = 0L, exit = 0L)) %>%
  # Calculate the total number of people inside
  mutate(total_inside = cumsum(enter + exit)) %>% 
  # Pivot the enter and exit columns to long format
  pivot_longer(enter:exit, names_to = "direction", values_to = "n_people") %>% 
  # Plot the number of people over time, fill by direction
  ggplot(aes(x = time, y = n_people, fill = direction)) +
  geom_area() +
  geom_line(aes(y = total_inside))

Good job! You can see the entries and exits in the area plot and the total number of people inside in the black line. Sensor data often comes at high frequencies so being able to complete timestamps is a powerful skill.

4 Rectangling data

In the final chapter, you’ll learn how to turn nested data structures such as JSON and XML files into tidy, rectangular data. This skill will enable you to process data from web APIs. You’ll also learn how nested data structures can be used to write elegant modeling pipelines that produce tidy outputs.

4.1 Intro to non-rectangular data

4.1.1 Rectangular vs non-rectangular files

You saw a number of file types in the video. Can you remember which store data in a rectangular format and which don’t?

Group these file types and formats into the right category.

CSV stands for Comma Separated Values. It’s often used to save raw spreadsheet data without formatting.
JSON stands for JavaScript Object Notation. It aims to be both human and machine readable.
XML stands for Extensible Markup Language. It aims to be both human and machine readable.

Rectangular data: Spreadsheet, CSV
Non-rectangular data: JSON, XML

Great! Understanding how these file types are different will help you pick the right tools while tidying.

4.1.2 Rectangling Star Wars movies

Let’s pretend you’re a big Star Wars fan and decided to scrape some data from the Star Wars API. You’ve already loaded the JSON-formatted response into R, and now have two lists of movies named movie_list and movie_planets_list. Your goal is to turn these into rectangular data frames with one row per movie so that you can start crunching those movie stats.

The dplyr package has been pre-loaded for you.

Create a tibble with a single column called movie out of the input movie_list.

movie_list=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/star_wars_movie_list.rds")
# Create a movie column from the movie_list
tibble(movie = movie_list)

## # A tibble: 7 × 1
##   movie           
##   <list>          
## 1 <named list [5]>
## 2 <named list [5]>
## 3 <named list [5]>
## 4 <named list [5]>
## 5 <named list [5]>
## 6 <named list [5]>
## 7 <named list [5]>

Widen the dataset by unnesting the movie column over multiple columns.

# Create a movie column from the movie_list
tibble(movie = movie_list) %>% 
  # Unnest the movie column
  unnest_wider(movie)

## # A tibble: 7 × 5
##   title                   episode_id release_date director         producer     
##   <chr>                        <int> <chr>        <chr>            <chr>        
## 1 A New Hope                       4 1977-05-25   George Lucas     Gary Kurtz, …
## 2 Attack of the Clones             2 2002-05-16   George Lucas     Rick McCallum
## 3 The Phantom Menace               1 1999-05-19   George Lucas     Rick McCallum
## 4 Revenge of the Sith              3 2005-05-19   George Lucas     Rick McCallum
## 5 Return of the Jedi               6 1983-05-25   Richard Marquand Howard G. Ka…
## 6 The Empire Strikes Back          5 1980-05-17   Irvin Kershner   Gary Kurtz, …
## 7 The Force Awakens                7 2015-12-11   J. J. Abrams     Kathleen Ken…

Re-create the tibble with a single column called movie out of movie_planets_list.

movie_planets_list=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/star_wars_movie_planet_list.rds")
# Create a tibble with a movie column
tibble(movie = movie_planets_list) %>% 
  # Unnest the movie column
  unnest_wider(movie)

## # A tibble: 7 × 3
##   title                   release_date planets   
##   <chr>                   <chr>        <list>    
## 1 A New Hope              1977-05-25   <chr [3]> 
## 2 The Empire Strikes Back 1980-05-17   <chr [4]> 
## 3 Return of the Jedi      1983-05-25   <chr [5]> 
## 4 The Phantom Menace      1999-05-19   <chr [3]> 
## 5 Attack of the Clones    2002-05-16   <chr [5]> 
## 6 Revenge of the Sith     2005-05-19   <chr [13]>
## 7 The Force Awakens       2015-12-11   <chr [1]>

Unnest the planets column to a wider format.

# Create a tibble with a movie column
tibble(movie = movie_planets_list) %>% 
  # Unnest the movie column
  unnest_wider(movie) %>% 
  # Unnest the planets column
  unnest_wider(planets)

## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3

## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4

## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5

## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3

## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5

## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...

## New names:
## * `` -> ...1

## # A tibble: 7 × 15
##   title release_date ...1  ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
##   <chr> <chr>        <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 A Ne… 1977-05-25   Alde… Yavi… Tato… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 2 The … 1980-05-17   Hoth  Dago… Besp… Ord … <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 3 Retu… 1983-05-25   Dago… Endor Naboo Coru… Tato… <NA>  <NA>  <NA>  <NA>  <NA> 
## 4 The … 1999-05-19   Naboo Coru… Tato… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 5 Atta… 2002-05-16   Naboo Coru… Kami… Geon… Tato… <NA>  <NA>  <NA>  <NA>  <NA> 
## 6 Reve… 2005-05-19   Alde… Dago… Naboo Coru… Utap… Must… Kash… Poli… Myge… Felu…
## 7 The … 2015-12-11   Jakku <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## # … with 3 more variables: ...11 <chr>, ...12 <chr>, ...13 <chr>

Interesting! You’re able to unnest the full tree structure of the movie_planets_list, but don’t end up with a tidy format. You’ll learn how to fix this in the next lesson.

4.2 From nested values to observations

4.2.1 Unnesting wide or long

In the slides, you learned some rules of thumb on when to use unnest_longer() or unnest_wider() to unnest a list column. You’ll now decide which function to use for two data samples.

Assign these columns to the function that you would unnest them with.

unnest_wider()

metadata 
<list>          
<named list [3]>
<named list [3]>
<named list [3]>
<named list [3]>

unnest_longer()

films      
<list>
<list [4]>
<list [5]>
<list [4]>
<list [3]>

Well done! These are the two situations where it is clear which unnesting function to try: named lists of fixed length = unnest_wider(), unnamed lists of varying length = unnest_longer(). There are other situations, like named lists with varying number of elements, where some trial and error will be necessary.

4.2.2 Rectangling Star Wars planets

Let’s finish what we started in the last exercise of the previous lesson, exploring Star Wars planets! The movie_planets_list scraped from the Star Wars API has been pre-loaded for you. You’ll need two specific unnesting operations to completely rectangle this data.

Create a tibble with a single column called movie out of movie_planets_list.

# Create a tibble from movie_planets_list
tibble(movie = movie_planets_list)

## # A tibble: 7 × 1
##   movie           
##   <list>          
## 1 <named list [3]>
## 2 <named list [3]>
## 3 <named list [3]>
## 4 <named list [3]>
## 5 <named list [3]>
## 6 <named list [3]>
## 7 <named list [3]>

Unnest the movie list column which contains named lists of equal length.

# Create a tibble from movie_planets_list
tibble(movie = movie_planets_list) %>% 
  # Unnest the movie column in the correct direction
  unnest_wider(movie)

## # A tibble: 7 × 3
##   title                   release_date planets   
##   <chr>                   <chr>        <list>    
## 1 A New Hope              1977-05-25   <chr [3]> 
## 2 The Empire Strikes Back 1980-05-17   <chr [4]> 
## 3 Return of the Jedi      1983-05-25   <chr [5]> 
## 4 The Phantom Menace      1999-05-19   <chr [3]> 
## 5 Attack of the Clones    2002-05-16   <chr [5]> 
## 6 Revenge of the Sith     2005-05-19   <chr [13]>
## 7 The Force Awakens       2015-12-11   <chr [1]>

Unnest the planets list column which contains unnamed lists of unequal length.

# Create a tibble from movie_planets_list
tibble(movie = movie_planets_list) %>% 
  # Unnest the movie column in the correct direction
  unnest_wider(movie) %>% 
  # Unnest the planets column in the correct direction
  unnest_longer(planets)

## # A tibble: 34 × 3
##    title                   release_date planets    
##    <chr>                   <chr>        <chr>      
##  1 A New Hope              1977-05-25   Alderaan   
##  2 A New Hope              1977-05-25   Yavin IV   
##  3 A New Hope              1977-05-25   Tatooine   
##  4 The Empire Strikes Back 1980-05-17   Hoth       
##  5 The Empire Strikes Back 1980-05-17   Dagobah    
##  6 The Empire Strikes Back 1980-05-17   Bespin     
##  7 The Empire Strikes Back 1980-05-17   Ord Mantell
##  8 Return of the Jedi      1983-05-25   Dagobah    
##  9 Return of the Jedi      1983-05-25   Endor      
## 10 Return of the Jedi      1983-05-25   Naboo      
## # … with 24 more rows

Nice work! You now have a tidy dataset with one observation for every planet featured in each movie.

4.2.3 The Solar System’s biggest moons

Most planets in our solar system are accompanied by at least one moon. You now wonder which planets are circled by the biggest moons and want to create a top five based on moon radius. However, you’ll first have to unnest the devstronomy project data in planet_df using the unnest_longer() and unnest_wider() functions.

The dplyr package has been pre-loaded for you.

Unnest the moons column so that each moon gets an observation.

planet_df=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/planet_moons_list_df.rds")
planet_df %>% 
  # Unnest the moons list column over observations
  unnest_longer(moons)

## # A tibble: 174 × 2
##    planet  moons           
##    <chr>   <list>          
##  1 Mercury <NULL>          
##  2 Venus   <NULL>          
##  3 Earth   <named list [2]>
##  4 Jupiter <named list [2]>
##  5 Jupiter <named list [2]>
##  6 Jupiter <named list [2]>
##  7 Jupiter <named list [2]>
##  8 Jupiter <named list [2]>
##  9 Jupiter <named list [2]>
## 10 Jupiter <named list [2]>
## # … with 164 more rows

Unnest the moons column so that its contents are split over columns.

planet_df %>% 
  # Unnest the moons list column over observations
  unnest_longer(moons) %>% 
  # Further unnest the moons column
  unnest_wider(moons)

## # A tibble: 174 × 3
##    planet  moon_name moon_data       
##    <chr>   <chr>     <list>          
##  1 Mercury <NA>      <NULL>          
##  2 Venus   <NA>      <NULL>          
##  3 Earth   Moon      <named list [2]>
##  4 Jupiter Io        <named list [2]>
##  5 Jupiter Europa    <named list [2]>
##  6 Jupiter Ganymede  <named list [2]>
##  7 Jupiter Callisto  <named list [2]>
##  8 Jupiter Amalthea  <named list [2]>
##  9 Jupiter Himalia   <named list [2]>
## 10 Jupiter Elara     <named list [2]>
## # … with 164 more rows

Unnest the moon_data column so that its contents are split over columns.

planet_df %>% 
  # Unnest the moons list column over observations
  unnest_longer(moons) %>% 
  # Further unnest the moons column
  unnest_wider(moons) %>% 
  # Unnest the moon_data column
  unnest_wider(moon_data)

## # A tibble: 174 × 4
##    planet  moon_name radius density
##    <chr>   <chr>      <dbl>   <dbl>
##  1 Mercury <NA>        NA    NA    
##  2 Venus   <NA>        NA    NA    
##  3 Earth   Moon      1738.    3.34 
##  4 Jupiter Io        1822.    3.53 
##  5 Jupiter Europa    1561.    3.01 
##  6 Jupiter Ganymede  2631.    1.94 
##  7 Jupiter Callisto  2410.    1.83 
##  8 Jupiter Amalthea    83.4   0.849
##  9 Jupiter Himalia     85     2.6  
## 10 Jupiter Elara       43     2.6  
## # … with 164 more rows

Use dplyr’s slice_max() function on moon radius to get a top 5 of biggest moons.

planet_df %>% 
  # Unnest the moons list column over observations
  unnest_longer(moons) %>% 
  # Further unnest the moons column
  unnest_wider(moons) %>% 
  # Unnest the moon_data column
  unnest_wider(moon_data) %>% 
  # Get the top five largest moons by radius
  slice_max(radius, n = 5)

## # A tibble: 5 × 4
##   planet  moon_name radius density
##   <chr>   <chr>      <dbl>   <dbl>
## 1 Jupiter Ganymede   2631.    1.94
## 2 Saturn  Titan      2575.    1.88
## 3 Jupiter Callisto   2410.    1.83
## 4 Jupiter Io         1822.    3.53
## 5 Earth   Moon       1738.    3.34

Well done! Jupiter has three moons in the top five with Ganymede being the largest overall. We also have Saturn’s Titan and our very own Moon. The more deeply nested your data is, the more iterations of unnest_longer() and unnest_wider() you’ll need to fully unnest it.

4.3 Selecting nested variables

4.3.1 Hoisting Star Wars films

You’ve been given a nested data set on Star Wars characters (character_df) and want to explore the films in which they appeared. You’ll first use the unnest_wider() and unnest_longer() functions to explore the data and will then switch to hoist() to select a specific element in the nested data structure directly.

Inspect character_df in the console.
Unnest the metadata column.

character_df=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/star_wars_characters.rds")
character_df %>%
  # Unnest the metadata column
  unnest_wider(metadata)

## # A tibble: 82 × 4
##    name                species      homeworld films    
##    <chr>               <chr>        <chr>     <list>   
##  1 Ackbar              Mon Calamari Mon Cala  <chr [2]>
##  2 Adi Gallia          Tholothian   Coruscant <chr [2]>
##  3 Anakin Skywalker    Human        Tatooine  <chr [3]>
##  4 Arvel Crynyd        Human        unknown   <chr [1]>
##  5 Ayla Secura         Twi'lek      Ryloth    <chr [3]>
##  6 Bail Prestor Organa Human        Alderaan  <chr [2]>
##  7 Barriss Offee       Mirialan     Mirial    <chr [1]>
##  8 BB8                 Droid        unknown   <chr [1]>
##  9 Ben Quadinaros      Toong        Tund      <chr [1]>
## 10 Beru Whitesun lars  Human        Tatooine  <chr [3]>
## # … with 72 more rows

Unnest the films column.

character_df %>% 
  # Unnest the metadata column
  unnest_wider(metadata) %>% 
  unnest_longer(films)

## # A tibble: 166 × 4
##    name             species      homeworld films               
##    <chr>            <chr>        <chr>     <chr>               
##  1 Ackbar           Mon Calamari Mon Cala  Return of the Jedi  
##  2 Ackbar           Mon Calamari Mon Cala  The Force Awakens   
##  3 Adi Gallia       Tholothian   Coruscant The Phantom Menace  
##  4 Adi Gallia       Tholothian   Coruscant Revenge of the Sith 
##  5 Anakin Skywalker Human        Tatooine  Attack of the Clones
##  6 Anakin Skywalker Human        Tatooine  The Phantom Menace  
##  7 Anakin Skywalker Human        Tatooine  Revenge of the Sith 
##  8 Arvel Crynyd     Human        unknown   Return of the Jedi  
##  9 Ayla Secura      Twi'lek      Ryloth    Attack of the Clones
## 10 Ayla Secura      Twi'lek      Ryloth    The Phantom Menace  
## # … with 156 more rows

As an alternative approach, use hoist() to select the first film from the films list nested in the metadata column.

character_df %>% 
  hoist(metadata, first_film = list("films", 1))

## # A tibble: 82 × 3
##    name                first_film           metadata        
##    <chr>               <chr>                <list>          
##  1 Ackbar              Return of the Jedi   <named list [3]>
##  2 Adi Gallia          The Phantom Menace   <named list [3]>
##  3 Anakin Skywalker    Attack of the Clones <named list [3]>
##  4 Arvel Crynyd        Return of the Jedi   <named list [3]>
##  5 Ayla Secura         Attack of the Clones <named list [3]>
##  6 Bail Prestor Organa Attack of the Clones <named list [3]>
##  7 Barriss Offee       Attack of the Clones <named list [3]>
##  8 BB8                 The Force Awakens    <named list [3]>
##  9 Ben Quadinaros      The Phantom Menace   <named list [3]>
## 10 Beru Whitesun lars  Attack of the Clones <named list [3]>
## # … with 72 more rows

Nice work! You dug into multiple layers of the nested structure to fetch the data you wanted.

4.3.2 Hoisting movie ratings

You’ve written a script to scrape data on your favorite movies from the Open Movie DataBase API. Now you want to process the JSON data to extract the Rotten Tomatoes rating for each movie. You’ve been given a data frame named movie_df which holds the JSON respones for five movies. You’ll explore this data with unnest_wider() and unnest_longer() before switching to hoist().

The dplyr package has been pre-loaded for you.

Unnest the movie column.

#movie_df %>% 
  # Unnest the movie column
#  unnest_wider(movie)

Unnest the Ratings column.

#movie_df %>% 
  # Unnest the movie column
#  unnest_wider(movie) %>% 
#  select(Title, Year, Ratings) %>% 
  # Unnest the Ratings column
#  unnest_wider(Ratings)

Use hoist() on the movie column to extract the Title, Year, and Rotten Tomatoes rating. Note that this rating is nested inside the Ratings column.

# movie_df %>% 
#   hoist(
#     movie,
#     title = "Title",
#     year = "Year",
#     rating = list("Ratings", "Rotten Tomatoes")
#   )

Well done! Most data scientists try to stay away from nested data as much as possible, that said, you now know how to get your hands dirty when you have to.

4.4 Nesting data for modeling

4.4.1 Tidy model outputs with broom

You’re trying to predict a person’s weight based on their waist circumference and stature (height). To do so you’re using the US army body measurement dataset ANSUR II. The model has already been trained for you using this code:

model <- lm(weight_kg ~ waist_circum_m + stature_m, data = ansur_df)

You will use the broom package’s glance() and tidy() functions in the console to inspect model outputs in a tidy format.

model=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/ansur_weight_model.rds")
library(broom)
tidy(model)

## # A tibble: 3 × 5
##   term           estimate std.error statistic p.value
##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)      -128.      1.25     -103.        0
## 2 waist_circum_m     99.8     0.615     162.        0
## 3 stature_m          68.1     0.781      87.2       0

glance(model)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.894         0.894  5.09    25638.       0     2 -18485. 36977. 37004.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

What is the R² value?

0.894

5.09

What is the standard error on the intercept?

-128.
0.615

1.25

Correct!

4.4.2 Nesting tibbles

You’re pre-processing the US army body measurement dataset ANSUR II to train multiple models in a single pipeline. You’ll experiment with the nest() function to create a list column with nested tibbles containing sub-sets of the data.

the dplyr package has been pre-loaded for you.

Group the data by army branch and then nest it.

# ansur_df=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/ANSUR_II.rds")
# ansur_df %>% 
#   # Group the data by branch, then nest
#   group_by(branch) %>% 
#   nest()

What is the shape of the Combat Arms branch nested dataset?

1,312 × 5

1,582 × 5

3,174 × 5

Group the data by both branch and sex, then nest it.

# ansur_df %>% 
#   # Group the data by branch and sex, then nest
#   group_by(branch, sex) %>% 
#   nest()

What is the nested tibble shape for females in the Combat Arms branch?

43 × 4

687 × 4
1,539 × 4

Well done! Now that you know how to nest your data and how to use the broom package to tidy model outputs, it’s time to put things together.

4.4.3 Modeling on nested data frames

You’ll be working on the US Army ANSUR II body measurement dataset, which has been pre-loaded as ansur_df. The goal is to nest the data for both sexes so that you can simultaneously train two linear models, one for each sex. These models will derive a person’s weight from their stature (height) and waist circumference. You’ll then unnest the data to inspect the model’s statistics produced by the glance() function from the broom package.

The dplyr, broom, and purrr packages have been pre-loaded for you.

Side note: In the provided code, the purrr package’s map() function applies functions on each nested data frame. Check out this package if you like using functions inside pipes!

Group the data by sex.
Nest the data.
Unnest the glanced column.

ansur_df=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/ANSUR_II.rds")
ansur_df %>%
  # Group the data by sex
  group_by(sex) %>% 
  # Nest the data
  nest() %>%
  mutate(
    fit = map(data, function(df) lm(weight_kg ~ waist_circum_m + stature_m, data = df)),
    glanced = map(fit, glance)
  ) %>% 
  # Unnest the glanced column
  unnest(glanced)

## # A tibble: 2 × 15
## # Groups:   sex [2]
##   sex    data     fit    r.squared adj.r.squared sigma statistic p.value    df
##   <chr>  <list>   <list>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>
## 1 Female <tibble> <lm>       0.822         0.822  4.63     4585.       0     2
## 2 Male   <tibble> <lm>       0.873         0.873  5.06    14076.       0     2
## # … with 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
## #   df.residual <int>, nobs <int>

Well done! You’ve trained multiple models and created tidy outputs in a single pipeline making it easy to compare them.

Congratulations! You did it! You’re now able to reshape data in R using tidyr. Tweet us your feedback and tell us what you think.