Reshaping Data with tidyr
Jeroen Boeye - DataCamp
Course Description
Data in the wild can be scary—when confronted with a complicated and messy dataset you may find yourself wondering, where do I even start? The tidyr package allows you to wrangle such beasts into nice and tidy datasets. Inaccessible values stored in column names will be put into rows, JSON files will become data frames, and missing values will never go missing again. You’ll practice these techniques on a wide range of messy datasets, learning along the way how many dogs the Soviet Union sent into space and what bird is most popular in New Zealand. With the tidyr package in your tidyverse toolkit, you’ll be able to transform almost any dataset in a tidy format which will pay-off during the rest of your analysis.
1 Tidy data
You’ll be introduced to the concept of tidy data which is central to this course. In the first two lessons, you’ll jump straight into the action by separating messy character columns into tidy variables and observations ready for analysis. In the final lesson, you’ll learn how to overwrite and remove missing values.
1.1 What is tidy data?
1.1.1 Tidy data structure
You saw a number of tidy data concepts in the video, can you remember how they relate to a rectangular data structure?
- Assign these concepts to their location in a rectangular data frame.
- Rows: Observations
- Columns: Variables
- Cells: Values
Good job! This is the very foundation of tidy data.
1.1.2 Multiple variables per column
Being a busy person, you don’t want to spend too much time on Netflix,
so you decide to crunch some numbers on TV show and movie durations
before deciding what to watch.
You’ve managed to obtain a dataset named netflix_df
, but its duration
column has an issue. It contains strings with both a value and unit of duration (“min”
or “Season”
).
You’ll tidy this dataset so that each variable gets its own column.
As will always be the case in this course, the tidyr
package has been pre-loaded for you.
netflix_df
by typing its name directly in the R
console at the bottom-right and hitting Enter to see what string
separates the value from the unit in the duration
column.
duration
column over two variables named value
and unit
. Pass the string separating the number from the unit to the sep
argument.
<- readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/netflix_duration.rds")
netflix_df %>%
netflix_df # Split the duration column into value and unit columns
separate(duration, into = c("value", "unit"), sep = " ", convert = TRUE)
## # A tibble: 637 × 4
## title type value unit
## <chr> <chr> <int> <chr>
## 1 Article 15 Movie 125 min
## 2 Kill Me If You Dare Movie 100 min
## 3 The Spy TV Show 1 Season
## 4 The World We Make Movie 108 min
## 5 Watchman Movie 93 min
## 6 Mo Gilligan: Momentum Movie 64 min
## 7 Domino Movie 89 min
## 8 TUNA GIRL Movie 90 min
## 9 Bard of Blood TV Show 1 Season
## 10 Dragons: Rescue Riders TV Show 1 Season
## # … with 627 more rows
Good job on creating your first tidy dataset in this course! Notice the convert = TRUE
option? It caused the value
column to be converted from string to integer.
1.2 Columns with multiple values
1.2.1 International phone numbers
You work for a multinational company that uses auto-dialer software to
contact its customers. When new customers subscribe online they are
asked for a phone number but they often forget to add the country code
needed for international calls. You were asked to fix this issue in the
database.
You’ve been given a data frame with national numbers and country codes
named phone_nr_df
. Now you want to combine the country_code
and national_number
columns to create valid international numbers.
-
Use the
unite()
function to create a newinternational_number
column, using an empty string as the separator.
=tribble(~country, ~country_code, ~national_number,
phone_nr_df"USA", "+1", "2025550117",
"United Kingdom", "+44", "1632960924",
"Brazil", "+55", "95552452220",
"Australia", "+61", "1900654321",
"China", "+86", "13555953217",
"India", "+91", "8555843898")
%>%
phone_nr_df # Unite the country_code and national_number columns
unite("international_number", country_code, national_number, sep = "")
## # A tibble: 6 × 2
## country international_number
## <chr> <chr>
## 1 USA +12025550117
## 2 United Kingdom +441632960924
## 3 Brazil +5595552452220
## 4 Australia +611900654321
## 5 China +8613555953217
## 6 India +918555843898
Well done! Being able to unite columns is a good skill to have but in practice you’ll be separating them more often while tidying data.
1.2.2 Extracting observations from values
You’re given a sample of the Netflix dataset containing TV shows and their casts called tvshow_df
. You want to learn which six actors have the most appearances.
However, the dataset only has one row per TV show, and multiple actors are listed in the cast
column.
Transform the data so that for each TV show, every actor has a row. The number of appearances will be calculated for you.
The dplyr
package has been pre-loaded for you.
=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/netflix_series.rds")
tvshow_df tvshow_df
## # A tibble: 1,759 × 2
## title cast
## <chr> <chr>
## 1 Transformers Prime Peter Cullen, Sumalee Montano, Frank Welker…
## 2 Transformers: Robots in Disguise Will Friedle, Darren Criss, Constance Zimme…
## 3 Apaches Alberto Ammann, Eloy Azorín, Verónica Echeg…
## 4 Castle of Stars Chaiyapol Pupart, Jintanutda Lummakanon, Wo…
## 5 Archibald's Next Big Thing Tony Hale, Rosamund Pike, Jordan Fisher, Ch…
## 6 The Spy Sacha Baron Cohen, Noah Emmerich, Hadar Rat…
## 7 No Tomorrow Joshua Sasse, Tori Anderson, Jonathan Langd…
## 8 Frequency Peyton List, Riley Smith, Mekhi Phifer, Dev…
## 9 Adam Ruins Everything Adam Conover, Adam Lustick, Emily Axford
## 10 Ben 10 Tara Strong, Montse Hernandez, David Kaye, …
## # … with 1,749 more rows
Inspect tvshow_df
in the console. What string separates actors in the cast
column?
-
” “
-
“,”
-
“,”
-
Use
separate_rows()
on thecast
column, using the appropriate separator for thesep
argument. -
Use the
head()
function to keep just the top six.
%>%
tvshow_df # Separate the actors in the cast column over multiple rows
separate_rows(cast, sep = ", ") %>%
rename(actor = cast) %>%
count(actor, sort = TRUE) %>%
head()
## # A tibble: 6 × 2
## actor n
## <chr> <int>
## 1 Takahiro Sakurai 18
## 2 Yuki Kaji 16
## 3 Daisuke Ono 14
## 4 David Attenborough 14
## 5 Ashleigh Ball 12
## 6 Hiroshi Kamiya 12
Well done! Once each actor has an observation per movie, getting the top six is easy. Interestingly, these are all voice actors or narrators.
1.2.3 Separating into columns and rows
Remember the drink ingredients data from the video? You’ve been given an similar version (drink_df
)
that also includes quantities and units. Now you want to create an
overview of how much of each ingredient you should buy to make these
drinks.
The dplyr
package has been pre-loaded for you.
-
Inspect
drink_df
in the console to find the right separator in theingredients
column. -
Separate the
ingredients
column so that for each drink each ingredient gets a row.
=tribble(~drink, ~ingredients,
drink_df"Chocolate milk", "milk 0.3 L; chocolate 40 g; sugar 10 g",
"Orange juice", "oranges 3; sugar 20 g",
"Cappuccino", "milk 0.1 L; water 0.1 L; coffee 30 g; sugar 5 g")
%>%
drink_df # Separate the ingredients over rows
separate_rows(ingredients, sep = "; ")
## # A tibble: 9 × 2
## drink ingredients
## <chr> <chr>
## 1 Chocolate milk milk 0.3 L
## 2 Chocolate milk chocolate 40 g
## 3 Chocolate milk sugar 10 g
## 4 Orange juice oranges 3
## 5 Orange juice sugar 20 g
## 6 Cappuccino milk 0.1 L
## 7 Cappuccino water 0.1 L
## 8 Cappuccino coffee 30 g
## 9 Cappuccino sugar 5 g
-
Inspect the output of the previous step to find the separator that splits the
ingredients
column into three columns:ingredient
,quantity
, andunit
. - Make sure to convert data types to numeric when possible.
%>%
drink_df # Separate the ingredients over rows
separate_rows(ingredients, sep = "; ") %>%
# Separate ingredients into three columns
separate(
ingredients, into = c("ingredient", "quantity", "unit"),
sep = " ",
convert = TRUE
)
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].
## # A tibble: 9 × 4
## drink ingredient quantity unit
## <chr> <chr> <dbl> <chr>
## 1 Chocolate milk milk 0.3 L
## 2 Chocolate milk chocolate 40 g
## 3 Chocolate milk sugar 10 g
## 4 Orange juice oranges 3 <NA>
## 5 Orange juice sugar 20 g
## 6 Cappuccino milk 0.1 L
## 7 Cappuccino water 0.1 L
## 8 Cappuccino coffee 30 g
## 9 Cappuccino sugar 5 g
-
Group the data by
ingredient
andunit
. - Calculate the total quantity of each ingredient.
%>%
drink_df # Separate the ingredients over rows
separate_rows(ingredients, sep = "; ") %>%
# Separate ingredients into three columns
separate(
ingredients, into = c("ingredient", "quantity", "unit"),
sep = " ",
convert = TRUE
%>%
) # Group by ingredient and unit
group_by(ingredient, unit) %>%
# Calculate the total quantity of each ingredient
summarize(quantity = sum(quantity))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].
## `summarise()` has grouped output by 'ingredient'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 3
## # Groups: ingredient [6]
## ingredient unit quantity
## <chr> <chr> <dbl>
## 1 chocolate g 40
## 2 coffee g 30
## 3 milk L 0.4
## 4 oranges <NA> 3
## 5 sugar g 35
## 6 water L 0.1
Good job! You separated a messy string column over both rows and columns and unlocked the data within. Note that the separate()
function correctly added a missing value for the oranges unit.
1.3 Missing values
1.3.1 And the Oscar for best director goes to …
You’re working on a sample of the Netflix dataset pre-loaded as director_df
.
This time, the data frame contains just the directors and movie titles.
Your goal is to identify the directors who created the most movies.
Since the director
column contains multiple names, you’ll first separate its values over multiple rows and then count the directors.
Since you don’t want movies without directors polluting your overview, you’ll apply the drop_na()
function.
The dplyr
package has been pre-loaded for you.
-
Inspect
director_df
in the console to see what string separates directors in thedirector
column. -
Spread the values in the
director
column over separate rows.
=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/netflix_directors.rds")
director_df%>%
director_df # Spread the director column over separate rows
separate_rows(director, sep = ", ")
## # A tibble: 4,815 × 2
## director title
## <chr> <chr>
## 1 Richard Finn Norm of the North: King Sized Adventure
## 2 Tim Maltby Norm of the North: King Sized Adventure
## 3 <NA> Jandino: Whatever it Takes
## 4 Fernando Lebrija #realityhigh
## 5 Gabe Ibáñez Automata
## 6 Rodrigo Toro Fabrizio Copano: Solo pienso en mi
## 7 Francisco Schultz Fabrizio Copano: Solo pienso en mi
## 8 Henrik Ruben Genz Good People
## 9 José Miguel Contreras Joaquín Reyes: Una y no más
## 10 Daniel Alfredson Kidnapping Mr. Heineken
## # … with 4,805 more rows
-
Count the number of times each
director
appears in the data. Make sure to sort the output.
%>%
director_df # Spread the director column over separate rows
separate_rows(director, sep = ", ") %>%
# Count the number of movies per director
count(director, sort = TRUE)
## # A tibble: 3,536 × 2
## director n
## <chr> <int>
## 1 <NA> 128
## 2 Jan Suter 21
## 3 Raúl Campos 19
## 4 Jay Karas 14
## 5 Marcus Raboy 14
## 6 Jay Chapman 12
## 7 Martin Scorsese 9
## 8 Steven Spielberg 9
## 9 David Dhawan 8
## 10 Johnnie To 8
## # … with 3,526 more rows
-
Drop rows containing
NA
values in thedirector
column.
%>%
director_df # Drop rows with NA values in the director column
drop_na(director) %>%
# Spread the director column over separate rows
separate_rows(director, sep = ", ") %>%
# Count the number of movies per director
count(director, sort = TRUE)
## # A tibble: 3,535 × 2
## director n
## <chr> <int>
## 1 Jan Suter 21
## 2 Raúl Campos 19
## 3 Jay Karas 14
## 4 Marcus Raboy 14
## 5 Jay Chapman 12
## 6 Martin Scorsese 9
## 7 Steven Spielberg 9
## 8 David Dhawan 8
## 9 Johnnie To 8
## 10 Lance Bangs 8
## # … with 3,525 more rows
Good job! Missing values will often show up when you sort values by how often they appear in the data. Expect them to be there and remove them early on!
1.3.2 Imputing sales data
You’ve been asked to create a report that allows management to compare sales figures per quarter for two years.
The problem is that the dataset (sales_df
) contains missing values. You’ll need to impute the values in the year
column so that you can visualize the data.
The ggplot2
package has been pre-loaded for you.
-
Inspect
sales_df
in the console, pay attention to theyear
column. -
Use the
fill()
function to impute theyear
column in the correct direction. - Create a line plot where each year has a different color.
=tribble(~year,~quarter,~sales,
sales_dfNA,"Q1","12498",
NA,"Q2","20461",
NA,"Q3","19737",
"2019","Q4","20314",
NA,"Q1","13494",
NA,"Q2","19314",
NA,"Q3","23640",
"2020","Q4","22920")
=sales_df %>% mutate(year=as.factor(year))
sales_df
%>%
sales_df # Impute the year column
fill(year, .direction = "up") %>%
# Create a line plot with sales per quarter colored by year.
ggplot(aes(x = quarter, y = sales, color = year, group = year)) +
geom_line()
Alright! Without imputing the year column first, the visual would have been all messed up.
1.3.3 Nuclear bombs per continent
Since WWII, a number of nations have been detonating nuclear bombs for
military research. A tally of bombs detonated per nation has been
calculated from the Nuclear Explosion DataBase (NEDB) and provided as nuke_df
. You are interested in finding out how many bombs have been detonated by nations grouped per continent. To achieve this goal, nuke_df
will be joined to country_to_continent_df
which is a mapping of nation to continent. You will need to overwrite
missing values with zeros so that you can create a nice plot.
The dplyr
and ggplot2
packages have been pre-loaded for you.
Side note 1: Bombs detonated by the Soviet Union were attributed to the Russian Federation.
Side note 2: The Russian Federation is solely mapped to Europe for simplicity.
-
Inspect
nuke_df
andcountry_to_continent_df
in the console. -
Replace the missing values in the
n_bombs
columns with0L
. Adding theL
sets the data type to integer.
=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRPCvCEPOCd5KrjlVF_iXe2oDxIiXM_2wfoJzp-Za-5PusRoqUy3dkYJ22HCyyrBmT1KMha82C_SPeP/pub?gid=1254696327&single=true&output=csv")
country_to_continent_df=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRpPo4a7Uj9jAtdmEC3p1YixmewdMQKKsKbuu3CoF4COKX7j-y0FpiRzxLGdaWVmb7mrvWOWEaph558/pub?gid=0&single=true&output=csv")
nuke_df%>%
country_to_continent_df left_join(nuke_df, by = "country_code") %>%
# Impute the missing values in the n_bombs column with 0L
replace_na(list(n_bombs = 0L))
## continent country_code
## 1 Asia AFG
## 2 Europe ALB
## 3 Antarctica ATA
## 4 Africa DZA
## 5 Oceania ASM
## 6 Europe AND
## 7 Africa AGO
## 8 North America ATG
## 9 Europe AZE
## 10 Asia AZE
## 11 South America ARG
## 12 Oceania AUS
## 13 Europe AUT
## 14 North America BHS
## 15 Asia BHR
## 16 Asia BGD
## 17 Europe ARM
## 18 Asia ARM
## 19 North America BRB
## 20 Europe BEL
## 21 North America BMU
## 22 Asia BTN
## 23 South America BOL
## 24 Europe BIH
## 25 Africa BWA
## 26 Antarctica BVT
## 27 South America BRA
## 28 North America BLZ
## 29 Asia IOT
## 30 Oceania SLB
## 31 North America VGB
## 32 Asia BRN
## 33 Europe BGR
## 34 Asia MMR
## 35 Africa BDI
## 36 Europe BLR
## 37 Asia KHM
## 38 Africa CMR
## 39 North America CAN
## 40 Africa CPV
## 41 North America CYM
## 42 Africa CAF
## 43 Asia LKA
## 44 Africa TCD
## 45 South America CHL
## 46 Asia CHN
## 47 Asia TWN
## 48 Asia CXR
## 49 Asia CCK
## 50 South America COL
## 51 Africa COM
## 52 Africa MYT
## 53 Africa COG
## 54 Africa COD
## 55 Oceania COK
## 56 North America CRI
## 57 Europe HRV
## 58 North America CUB
## 59 Europe CYP
## 60 Asia CYP
## 61 Europe CZE
## 62 Africa BEN
## 63 Europe DNK
## 64 North America DMA
## 65 North America DOM
## 66 South America ECU
## 67 North America SLV
## 68 Africa GNQ
## 69 Africa ETH
## 70 Africa ERI
## 71 Europe EST
## 72 Europe FRO
## 73 South America FLK
## 74 Antarctica SGS
## 75 Oceania FJI
## 76 Europe FIN
## 77 Europe ALA
## 78 Europe FRA
## 79 South America GUF
## 80 Oceania PYF
## 81 Antarctica ATF
## 82 Africa DJI
## 83 Africa GAB
## 84 Europe GEO
## 85 Asia GEO
## 86 Africa GMB
## 87 Asia PSE
## 88 Europe DEU
## 89 Africa GHA
## 90 Europe GIB
## 91 Oceania KIR
## 92 Europe GRC
## 93 North America GRL
## 94 North America GRD
## 95 North America GLP
## 96 Oceania GUM
## 97 North America GTM
## 98 Africa GIN
## 99 South America GUY
## 100 North America HTI
## 101 Antarctica HMD
## 102 Europe VAT
## 103 North America HND
## 104 Asia HKG
## 105 Europe HUN
## 106 Europe ISL
## 107 Asia IND
## 108 Asia IDN
## 109 Asia IRN
## 110 Asia IRQ
## 111 Europe IRL
## 112 Asia ISR
## 113 Europe ITA
## 114 Africa CIV
## 115 North America JAM
## 116 Asia JPN
## 117 Europe KAZ
## 118 Asia KAZ
## 119 Asia JOR
## 120 Africa KEN
## 121 Asia PRK
## 122 Asia KOR
## 123 Asia KWT
## 124 Asia KGZ
## 125 Asia LAO
## 126 Asia LBN
## 127 Africa LSO
## 128 Europe LVA
## 129 Africa LBR
## 130 Africa LBY
## 131 Europe LIE
## 132 Europe LTU
## 133 Europe LUX
## 134 Asia MAC
## 135 Africa MDG
## 136 Africa MWI
## 137 Asia MYS
## 138 Asia MDV
## 139 Africa MLI
## 140 Europe MLT
## 141 North America MTQ
## 142 Africa MRT
## 143 Africa MUS
## 144 North America MEX
## 145 Europe MCO
## 146 Asia MNG
## 147 Europe MDA
## 148 Europe MNE
## 149 North America MSR
## 150 Africa MAR
## 151 Africa MOZ
## 152 Asia OMN
## 153 Africa NAM
## 154 Oceania NRU
## 155 Asia NPL
## 156 Europe NLD
## 157 North America ANT
## 158 North America CUW
## 159 North America ABW
## 160 North America SXM
## 161 North America BES
## 162 Oceania NCL
## 163 Oceania VUT
## 164 Oceania NZL
## 165 North America NIC
## 166 Africa NER
## 167 Africa NGA
## 168 Oceania NIU
## 169 Oceania NFK
## 170 Europe NOR
## 171 Oceania MNP
## 172 Oceania UMI
## 173 North America UMI
## 174 Oceania FSM
## 175 Oceania MHL
## 176 Oceania PLW
## 177 Asia PAK
## 178 North America PAN
## 179 Oceania PNG
## 180 South America PRY
## 181 South America PER
## 182 Asia PHL
## 183 Oceania PCN
## 184 Europe POL
## 185 Europe PRT
## 186 Africa GNB
## 187 Asia TLS
## 188 North America PRI
## 189 Asia QAT
## 190 Africa REU
## 191 Europe ROU
## 192 Europe RUS
## 193 Africa RWA
## 194 North America BLM
## 195 Africa SHN
## 196 North America KNA
## 197 North America AIA
## 198 North America LCA
## 199 North America MAF
## 200 North America SPM
## 201 North America VCT
## 202 Europe SMR
## 203 Africa STP
## 204 Asia SAU
## 205 Africa SEN
## 206 Europe SRB
## 207 Africa SYC
## 208 Africa SLE
## 209 Asia SGP
## 210 Europe SVK
## 211 Asia VNM
## 212 Europe SVN
## 213 Africa SOM
## 214 Africa ZAF
## 215 Africa ZWE
## 216 Europe ESP
## 217 Africa SSD
## 218 Africa ESH
## 219 Africa SDN
## 220 South America SUR
## 221 Europe SJM
## 222 Africa SWZ
## 223 Europe SWE
## 224 Europe CHE
## 225 Asia SYR
## 226 Asia TJK
## 227 Asia THA
## 228 Africa TGO
## 229 Oceania TKL
## 230 Oceania TON
## 231 North America TTO
## 232 Asia ARE
## 233 Africa TUN
## 234 Europe TUR
## 235 Asia TUR
## 236 Asia TKM
## 237 North America TCA
## 238 Oceania TUV
## 239 Africa UGA
## 240 Europe UKR
## 241 Europe MKD
## 242 Africa EGY
## 243 Europe GBR
## 244 Europe GGY
## 245 Europe JEY
## 246 Europe IMN
## 247 Africa TZA
## 248 North America USA
## 249 North America VIR
## 250 Africa BFA
## 251 South America URY
## 252 Asia UZB
## 253 South America VEN
## 254 Oceania WLF
## 255 Oceania WSM
## 256 Asia YEM
## 257 Africa ZMB
## 258 Oceania > <NA
## country_name
## 1 Afghanistan, Islamic Republic of
## 2 Albania, Republic of
## 3 Antarctica (the territory South of 60 deg S)
## 4 Algeria, People's Democratic Republic of
## 5 American Samoa
## 6 Andorra, Principality of
## 7 Angola, Republic of
## 8 Antigua and Barbuda
## 9 Azerbaijan, Republic of
## 10 Azerbaijan, Republic of
## 11 Argentina, Argentine Republic
## 12 Australia, Commonwealth of
## 13 Austria, Republic of
## 14 Bahamas, Commonwealth of the
## 15 Bahrain, Kingdom of
## 16 Bangladesh, People's Republic of
## 17 Armenia, Republic of
## 18 Armenia, Republic of
## 19 Barbados
## 20 Belgium, Kingdom of
## 21 Bermuda
## 22 Bhutan, Kingdom of
## 23 Bolivia, Republic of
## 24 Bosnia and Herzegovina
## 25 Botswana, Republic of
## 26 Bouvet Island (Bouvetoya)
## 27 Brazil, Federative Republic of
## 28 Belize
## 29 British Indian Ocean Territory (Chagos Archipel~\nSolomon Islands"
## 30 British Virgin Islands
## 31 Brunei Darussalam
## 32 Bulgaria, Republic of
## 33 Myanmar, Union of
## 34 Burundi, Republic of
## 35 Belarus, Republic of
## 36 Cambodia, Kingdom of
## 37 Cameroon, Republic of
## 38 Canada
## 39 Cape Verde, Republic of
## 40 Cayman Islands
## 41 Central African Republic
## 42 Sri Lanka, Democratic Socialist Republic of
## 43 Chad, Republic of
## 44 Chile, Republic of
## 45 China, People's Republic of
## 46 Taiwan
## 47 Christmas Island
## 48 Cocos (Keeling) Islands
## 49 Colombia, Republic of
## 50 Comoros, Union of the
## 51 Mayotte
## 52 Congo, Republic of the
## 53 Congo, Democratic Republic of the
## 54 Cook Islands
## 55 Costa Rica, Republic of
## 56 Croatia, Republic of
## 57 Cuba, Republic of
## 58 Cyprus, Republic of
## 59 Cyprus, Republic of
## 60 Czech Republic
## 61 Benin, Republic of
## 62 Denmark, Kingdom of
## 63 Dominica, Commonwealth of
## 64 Dominican Republic
## 65 Ecuador, Republic of
## 66 El Salvador, Republic of
## 67 Equatorial Guinea, Republic of
## 68 Ethiopia, Federal Democratic Republic of
## 69 Eritrea, State of
## 70 Estonia, Republic of
## 71 Faroe Islands
## 72 Falkland Islands (Malvinas)
## 73 South Georgia and the South Sandwich Islands
## 74 Fiji, Republic of the Fiji Islands
## 75 Finland, Republic of
## 76 \\u00c3\\u2026land Islands
## 77 France, French Republic
## 78 French Guiana
## 79 French Polynesia
## 80 French Southern Territories
## 81 Djibouti, Republic of
## 82 Gabon, Gabonese Republic
## 83 Georgia
## 84 Georgia
## 85 Gambia, Republic of the
## 86 Palestinian Territory, Occupied
## 87 Germany, Federal Republic of
## 88 Ghana, Republic of
## 89 Gibraltar
## 90 Kiribati, Republic of
## 91 Greece, Hellenic Republic
## 92 Greenland
## 93 Grenada
## 94 Guadeloupe
## 95 Guam
## 96 Guatemala, Republic of
## 97 Guinea, Republic of
## 98 Guyana, Co-operative Republic of
## 99 Haiti, Republic of
## 100 Heard Island and McDonald Islands
## 101 Holy See (Vatican City State)
## 102 Honduras, Republic of
## 103 Hong Kong, Special Administrative Region of Chi~\nHungary, Republic of"
## 104 Iceland, Republic of
## 105 India, Republic of
## 106 Indonesia, Republic of
## 107 Iran, Islamic Republic of
## 108 Iraq, Republic of
## 109 Ireland
## 110 Israel, State of
## 111 Italy, Italian Republic
## 112 Cote d'Ivoire, Republic of
## 113 Jamaica
## 114 Japan
## 115 Kazakhstan, Republic of
## 116 Kazakhstan, Republic of
## 117 Jordan, Hashemite Kingdom of
## 118 Kenya, Republic of
## 119 Korea, Democratic People's Republic of
## 120 Korea, Republic of
## 121 Kuwait, State of
## 122 Kyrgyz Republic
## 123 Lao People's Democratic Republic
## 124 Lebanon, Lebanese Republic
## 125 Lesotho, Kingdom of
## 126 Latvia, Republic of
## 127 Liberia, Republic of
## 128 Libyan Arab Jamahiriya
## 129 Liechtenstein, Principality of
## 130 Lithuania, Republic of
## 131 Luxembourg, Grand Duchy of
## 132 Macao, Special Administrative Region of China
## 133 Madagascar, Republic of
## 134 Malawi, Republic of
## 135 Malaysia
## 136 Maldives, Republic of
## 137 Mali, Republic of
## 138 Malta, Republic of
## 139 Martinique
## 140 Mauritania, Islamic Republic of
## 141 Mauritius, Republic of
## 142 Mexico, United Mexican States
## 143 Monaco, Principality of
## 144 Mongolia
## 145 Moldova, Republic of
## 146 Montenegro, Republic of
## 147 Montserrat
## 148 Morocco, Kingdom of
## 149 Mozambique, Republic of
## 150 Oman, Sultanate of
## 151 Namibia, Republic of
## 152 Nauru, Republic of
## 153 Nepal, State of
## 154 Netherlands, Kingdom of the
## 155 Netherlands Antilles
## 156 Cura\\u00c3\\u00a7ao
## 157 Aruba
## 158 Sint Maarten (Netherlands)
## 159 Bonaire, Sint Eustatius and Saba
## 160 New Caledonia
## 161 Vanuatu, Republic of
## 162 New Zealand
## 163 Nicaragua, Republic of
## 164 Niger, Republic of
## 165 Nigeria, Federal Republic of
## 166 Niue
## 167 Norfolk Island
## 168 Norway, Kingdom of
## 169 Northern Mariana Islands, Commonwealth of the
## 170 United States Minor Outlying Islands
## 171 United States Minor Outlying Islands
## 172 Micronesia, Federated States of
## 173 Marshall Islands, Republic of the
## 174 Palau, Republic of
## 175 Pakistan, Islamic Republic of
## 176 Panama, Republic of
## 177 Papua New Guinea, Independent State of
## 178 Paraguay, Republic of
## 179 Peru, Republic of
## 180 Philippines, Republic of the
## 181 Pitcairn Islands
## 182 Poland, Republic of
## 183 Portugal, Portuguese Republic
## 184 Guinea-Bissau, Republic of
## 185 Timor-Leste, Democratic Republic of
## 186 Puerto Rico, Commonwealth of
## 187 Qatar, State of
## 188 Reunion
## 189 Romania
## 190 Russian Federation
## 191 Rwanda, Republic of
## 192 Saint Barthelemy
## 193 Saint Helena
## 194 Saint Kitts and Nevis, Federation of
## 195 Anguilla
## 196 Saint Lucia
## 197 Saint Martin
## 198 Saint Pierre and Miquelon
## 199 Saint Vincent and the Grenadines
## 200 San Marino, Republic of
## 201 Sao Tome and Principe, Democratic Republic of
## 202 Saudi Arabia, Kingdom of
## 203 Senegal, Republic of
## 204 Serbia, Republic of
## 205 Seychelles, Republic of
## 206 Sierra Leone, Republic of
## 207 Singapore, Republic of
## 208 Slovakia (Slovak Republic)
## 209 Vietnam, Socialist Republic of
## 210 Slovenia, Republic of
## 211 Somalia, Somali Republic
## 212 South Africa, Republic of
## 213 Zimbabwe, Republic of
## 214 Spain, Kingdom of
## 215 South Sudan
## 216 Western Sahara
## 217 Sudan, Republic of
## 218 Suriname, Republic of
## 219 Svalbard & Jan Mayen Islands
## 220 Swaziland, Kingdom of
## 221 Sweden, Kingdom of
## 222 Switzerland, Swiss Confederation
## 223 Syrian Arab Republic
## 224 Tajikistan, Republic of
## 225 Thailand, Kingdom of
## 226 Togo, Togolese Republic
## 227 Tokelau
## 228 Tonga, Kingdom of
## 229 Trinidad and Tobago, Republic of
## 230 United Arab Emirates
## 231 Tunisia, Tunisian Republic
## 232 Turkey, Republic of
## 233 Turkey, Republic of
## 234 Turkmenistan
## 235 Turks and Caicos Islands
## 236 Tuvalu
## 237 Uganda, Republic of
## 238 Ukraine
## 239 Macedonia, The Former Yugoslav Republic of
## 240 Egypt, Arab Republic of
## 241 United Kingdom of Great Britain & Northern Irel~\nGuernsey, Bailiwick of"
## 242 Jersey, Bailiwick of
## 243 Isle of Man
## 244 Tanzania, United Republic of
## 245 United States of America
## 246 United States Virgin Islands
## 247 Burkina Faso
## 248 Uruguay, Eastern Republic of
## 249 Uzbekistan, Republic of
## 250 Venezuela, Bolivarian Republic of
## 251 Wallis and Futuna
## 252 Samoa, Independent State of
## 253 Yemen
## 254 Zambia, Republic of
## 255 Disputed Territory
## 256
## 257
## 258
## n_bombs
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
## 11 0
## 12 0
## 13 0
## 14 0
## 15 0
## 16 0
## 17 0
## 18 0
## 19 0
## 20 0
## 21 0
## 22 0
## 23 0
## 24 0
## 25 0
## 26 0
## 27 0
## 28 0
## 29 0
## 30 0
## 31 0
## 32 0
## 33 0
## 34 0
## 35 0
## 36 0
## 37 0
## 38 0
## 39 0
## 40 0
## 41 0
## 42 0
## 43 0
## 44 0
## 45 0
## 46 45
## 47 0
## 48 0
## 49 0
## 50 0
## 51 0
## 52 0
## 53 0
## 54 0
## 55 0
## 56 0
## 57 0
## 58 0
## 59 0
## 60 0
## 61 0
## 62 0
## 63 0
## 64 0
## 65 0
## 66 0
## 67 0
## 68 0
## 69 0
## 70 0
## 71 0
## 72 0
## 73 0
## 74 0
## 75 0
## 76 0
## 77 0
## 78 200
## 79 0
## 80 0
## 81 0
## 82 0
## 83 0
## 84 0
## 85 0
## 86 0
## 87 0
## 88 0
## 89 0
## 90 0
## 91 0
## 92 0
## 93 0
## 94 0
## 95 0
## 96 0
## 97 0
## 98 0
## 99 0
## 100 0
## 101 0
## 102 0
## 103 0
## 104 0
## 105 0
## 106 0
## 107 6
## 108 0
## 109 0
## 110 0
## 111 0
## 112 0
## 113 0
## 114 0
## 115 0
## 116 0
## 117 0
## 118 0
## 119 0
## 120 0
## 121 6
## 122 0
## 123 0
## 124 0
## 125 0
## 126 0
## 127 0
## 128 0
## 129 0
## 130 0
## 131 0
## 132 0
## 133 0
## 134 0
## 135 0
## 136 0
## 137 0
## 138 0
## 139 0
## 140 0
## 141 0
## 142 0
## 143 0
## 144 0
## 145 0
## 146 0
## 147 0
## 148 0
## 149 0
## 150 0
## 151 0
## 152 0
## 153 0
## 154 0
## 155 0
## 156 0
## 157 0
## 158 0
## 159 0
## 160 0
## 161 0
## 162 0
## 163 0
## 164 0
## 165 0
## 166 0
## 167 0
## 168 0
## 169 0
## 170 0
## 171 0
## 172 0
## 173 0
## 174 0
## 175 0
## 176 0
## 177 2
## 178 0
## 179 0
## 180 0
## 181 0
## 182 0
## 183 0
## 184 0
## 185 0
## 186 0
## 187 0
## 188 0
## 189 0
## 190 0
## 191 0
## 192 726
## 193 0
## 194 0
## 195 0
## 196 0
## 197 0
## 198 0
## 199 0
## 200 0
## 201 0
## 202 0
## 203 0
## 204 0
## 205 0
## 206 0
## 207 0
## 208 0
## 209 0
## 210 0
## 211 0
## 212 0
## 213 0
## 214 0
## 215 0
## 216 0
## 217 0
## 218 0
## 219 0
## 220 0
## 221 0
## 222 0
## 223 0
## 224 0
## 225 0
## 226 0
## 227 0
## 228 0
## 229 0
## 230 0
## 231 0
## 232 0
## 233 0
## 234 0
## 235 0
## 236 0
## 237 0
## 238 0
## 239 0
## 240 0
## 241 0
## 242 0
## 243 21
## 244 0
## 245 0
## 246 0
## 247 0
## 248 1150
## 249 0
## 250 0
## 251 0
## 252 0
## 253 0
## 254 0
## 255 0
## 256 0
## 257 0
## 258 0
-
Group the dataset by
continent
and aggregate the data by summing the number of bombs.
%>%
country_to_continent_df left_join(nuke_df, by = "country_code") %>%
# Impute the missing values in the n_bombs column with 0L
replace_na(list(n_bombs = 0L)) %>%
# Group the dataset by continent
group_by(continent) %>%
# Sum the number of bombs per continent
summarize(n_bombs_continent = sum(n_bombs))
## # A tibble: 8 × 2
## continent n_bombs_continent
## <chr> <int>
## 1 Africa 0
## 2 Antarctica 0
## 3 Asia 59
## 4 Europe 947
## 5 North America 1150
## 6 Oceania 0
## 7 Oceania > 0
## 8 South America 0
- Plot the summed number of bombs detonated by nations from each continent.
%>%
country_to_continent_df left_join(nuke_df, by = "country_code") %>%
# Impute the missing values in the n_bombs column with 0L
replace_na(list(n_bombs = 0L)) %>%
# Group the dataset by continent
group_by(continent) %>%
# Sum the number of bombs per continent
summarize(n_bombs_continent = sum(n_bombs)) %>%
# Plot the number of bombs per continent
ggplot(aes(x = continent, y = n_bombs_continent)) +
geom_col()
Kaboom! You just finished Chapter 1! Because you replaced missing values
with zeros for non nuclear powers, all continents show up in the plot.
Enjoying the course so far? Tell us what you think via Twitter!
2 From wide to long and back
This chapter is all about pivoting data from a wide to long format and back again using the pivot_longer()
and pivot_wider()
functions. You’ll need these functions when variables are hidden in
messy column names or when variables are stored in rows instead of
columns. You’ll learn about space dogs, nuclear bombs, and planet
temperatures along the way.
2.1 From wide to long data
2.1.1 Nuclear bombs per country
You’ve been given a version of the Nuclear Explosion DataBase (NEDB) where country names are specified in the column headers (nuke_df
). You want to visualize how many nukes were detonated per year per country. You’ll need to pivot the data and replace NA
values first.
The ggplot2
package has been pre-loaded for you.
“A-Day” First atomic bomb explosion at Bikini in the Marshall Islands 1 July 1946.
-
Pivot all columns except for
year
to a longer format.
=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/nuke_wide_country.rds")
nuke_df%>%
nuke_df # Pivot the data to a longer format
pivot_longer(-year)
## # A tibble: 584 × 3
## year name value
## <dbl> <chr> <int>
## 1 1945 United States 3
## 2 1945 Russian Federation NA
## 3 1945 United Kingdom NA
## 4 1945 France NA
## 5 1945 China NA
## 6 1945 India NA
## 7 1945 Pakistan NA
## 8 1945 North Korea NA
## 9 1946 United States 2
## 10 1946 Russian Federation NA
## # … with 574 more rows
-
The country names are now in the
name
column. Overwrite its name withcountry
. Thevalue
column should be namedn_bombs
.
%>%
nuke_df # Pivot the data to a longer format
pivot_longer(
-year,
# Overwrite the names of the two new columns
names_to = "country",
values_to = "n_bombs"
)
## # A tibble: 584 × 3
## year country n_bombs
## <dbl> <chr> <int>
## 1 1945 United States 3
## 2 1945 Russian Federation NA
## 3 1945 United Kingdom NA
## 4 1945 France NA
## 5 1945 China NA
## 6 1945 India NA
## 7 1945 Pakistan NA
## 8 1945 North Korea NA
## 9 1946 United States 2
## 10 1946 Russian Federation NA
## # … with 574 more rows
-
Replace the
NA
values in then_bombs
column with integer zero values (0L
).
%>%
nuke_df # Pivot the data to a longer format
pivot_longer(
-year,
# Overwrite the names of the two new columns
names_to = "country",
values_to = "n_bombs"
%>%
) # Replace NA values for n_bombs with 0L
replace_na(list(n_bombs = 0L))
## # A tibble: 584 × 3
## year country n_bombs
## <dbl> <chr> <int>
## 1 1945 United States 3
## 2 1945 Russian Federation 0
## 3 1945 United Kingdom 0
## 4 1945 France 0
## 5 1945 China 0
## 6 1945 India 0
## 7 1945 Pakistan 0
## 8 1945 North Korea 0
## 9 1946 United States 2
## 10 1946 Russian Federation 0
## # … with 574 more rows
- Create a line plot where the number of bombs dropped per country is plotted over time. Use country to color the lines.
%>%
nuke_df # Pivot the data to a longer format
pivot_longer(
-year,
# Overwrite the names of the two new columns
names_to = "country",
values_to = "n_bombs"
%>%
) # Replace NA values for n_bombs with 0L
replace_na(list(n_bombs = 0L)) %>%
# Plot the number of bombs per country over time
ggplot(aes(x = year, y = n_bombs, color = country)) +
geom_line()
Good job! You can see the Cold War nuclear arms race heat up in the
sixties and then gradually wane down. Knowing when and how to combine
different tidyr
functions like pivot_longer()
and replace_na()
is an important skill best learned through practice.
2.1.2 WHO obesity per country
According to the World Health Organization (WHO), worldwide obesity has
nearly tripled since 1975. You’re interested in the severity of this
global health issue per country and whether males and females are
affected differently. You’ll use the WHO’s obesity data (obesity_df
)
to investigate this issue. The data holds the percentage of females,
males, and both sexes combined that are considered obese (BMI > 30)
per country.
You want to create a scatterplot where, per nation, you can see the obesity data colored differently for females and males. This implies that sex should become a variable with its own column.
The ggplot2
package has been pre-loaded for you.
-
Inspect
obesity_df
in the console. -
Pivot the
male
andfemale
columns. The old column names should go in thesex
column, the original values should go in thepct_obese
column.
=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRpPo4a7Uj9jAtdmEC3p1YixmewdMQKKsKbuu3CoF4COKX7j-y0FpiRzxLGdaWVmb7mrvWOWEaph558/pub?gid=278516960&single=true&output=csv")
obesity_df%>%
obesity_df # Pivot the male and female columns
pivot_longer(c(male, female),
names_to = "sex",
values_to = "pct_obese")
## # A tibble: 380 × 4
## country both_sexes sex pct_obese
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 5.5 male 3.2
## 2 Afghanistan 5.5 female 7.6
## 3 Albania 21.7 male 21.6
## 4 Albania 21.7 female 21.8
## 5 Algeria 27.4 male 19.9
## 6 Algeria 27.4 female 34.9
## 7 Andorra 25.6 male 25.9
## 8 Andorra 25.6 female 25.3
## 9 Angola 8.2 male 4
## 10 Angola 8.2 female 12.1
## # … with 370 more rows
-
Create a scatterplot with
pct_obese
percountry
colored bysex
. Thecountry
variable has been ordered by overall obesity % and added for you.
%>%
obesity_df # Pivot the male and female columns
pivot_longer(c(male, female),
names_to = "sex",
values_to = "pct_obese") %>%
# Create a scatter plot with pct_obese per country colored by sex
ggplot(aes(x = pct_obese, color = sex,
y = forcats::fct_reorder(country, both_sexes))) +
geom_point() +
scale_y_discrete(breaks = c("India", "Nauru", "Cuba", "Brazil",
"Pakistan", "Gabon", "Italy", "Oman",
"China", "United States of America")) +
labs(x = "% Obese", y = "Country")
## Warning: Removed 4 rows containing missing values (geom_point).
Well done! There is a huge spread in obesity prevalence over countries.
Overall, females are affected more often. Because you put the sex
variable in its own column it became easy to add it to a ggplot
.
2.1.3 Bond… James Bond
You’ve been given a James Bond movie dataset (bond_df
) and
want to visualize the number of movies that Bond actors have featured in
per decade. However, the data is in an untidy format with the decade
values captured in the column headers. You’ll tidy this dataset to give
each variable its own column.
The ggplot2
package has been pre-loaded for you.
-
Pivot all columns except
Bond
to a longer format and set the names of the newly created columns todecade
andn_movies
.
=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRpPo4a7Uj9jAtdmEC3p1YixmewdMQKKsKbuu3CoF4COKX7j-y0FpiRzxLGdaWVmb7mrvWOWEaph558/pub?gid=595802601&single=true&output=csv")
bond_df
colnames(bond_df)=c("Bond",1960, 1970, 1980, 1990, 2000, 2010, 2020)
%>%
bond_df # Pivot the data to long format and set the column names
pivot_longer(
-Bond,
names_to = "decade",
values_to = "n_movies"
)
## # A tibble: 49 × 3
## Bond decade n_movies
## <chr> <chr> <int>
## 1 Sean Connery 1960 5
## 2 Sean Connery 1970 1
## 3 Sean Connery 1980 NA
## 4 Sean Connery 1990 NA
## 5 Sean Connery 2000 NA
## 6 Sean Connery 2010 NA
## 7 Sean Connery 2020 NA
## 8 David Niven 1960 1
## 9 David Niven 1970 NA
## 10 David Niven 1980 NA
## # … with 39 more rows
-
Drop any
NA
values in then_movies
column while it is created.
%>%
bond_df # Pivot the data to long format
pivot_longer(
-Bond,
# Overwrite the names of the two newly created columns
names_to = "decade",
values_to = "n_movies",
# Drop na values
values_drop_na = TRUE
)
## # A tibble: 12 × 3
## Bond decade n_movies
## <chr> <chr> <int>
## 1 Sean Connery 1960 5
## 2 Sean Connery 1970 1
## 3 David Niven 1960 1
## 4 George Lazenby 1960 1
## 5 Roger Moore 1970 4
## 6 Roger Moore 1980 3
## 7 Timothy Dalton 1980 2
## 8 Pierce Brosnan 1990 3
## 9 Pierce Brosnan 2000 1
## 10 Daniel Craig 2000 2
## 11 Daniel Craig 2010 2
## 12 Daniel Craig 2020 1
-
Transform the
decade
column data type to integer.
%>%
bond_df # Pivot the data to long format
pivot_longer(
-Bond,
# Overwrite the names of the two newly created columns
names_to = "decade",
values_to = "n_movies",
# Drop na values
values_drop_na = TRUE,
# Transform the decade column data type to integer
names_transform = list(decade = as.integer)
%>%
) ggplot(aes(x = decade + 5, y = n_movies, fill = Bond))+
geom_col()
Well done! If you’ve worked with tidyr
in the past, you may remember the gather()
function. This is a now retired predecessor of the pivot_longer()
function.
2.2 Deriving variables from column headers
2.2.1 New-Zealand’s bird of the year
Every year New Zealanders vote en masse to decide which species gets the
bird of the year trophy. The contest is organized by the Forest & Bird agency
which allows each person to give points to up to five birds (first pick
gets 5 points, second 4, …). Your job is to decide this year’s winner
from the messy dataset that’s been pre-loaded for you as bird_df
.
The dplyr
package has been pre-loaded for you.
-
Inspect
bird_df
in the console. -
Pivot
bird_df
to longer format so that an integer columnpoints
and a character columnspecies
are created. Use thenames_prefix
argument to clean up thepoints
column and make sure noNA
values remain.
=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQi_yI_hSJxlAZpWoqIrp8F9b-g6PLT0snajIporAW2hw8beu90XVaoL_SgfmZ0NAmSRqJyHcCNghPO/pub?gid=1305911804&single=true&output=csv")
bird_df%>%
bird_df # Pivot the data to create a two column data frame
pivot_longer(
starts_with("points_"),
names_to = "points",
names_prefix = "points_",
names_transform = list(points = as.integer),
values_to = "species",
values_drop_na = TRUE
)
## # A tibble: 40,000 × 3
## point_3 points species
## <chr> <int> <chr>
## 1 Kākā 5 "Gibson's Albatross"
## 2 Kākā 4 "Tūī"
## 3 Kākā 2 "Kākāpō"
## 4 Kākā 1 "Little Spotted Kiwi"
## 5 Weka 5 "Spotted Shag"
## 6 Weka 4 "Fantail"
## 7 Weka 2 ""
## 8 Weka 1 ""
## 9 Little Spotted Kiwi 5 "Great Spotted Kiwi"
## 10 Little Spotted Kiwi 4 "Kea"
## # … with 39,990 more rows
-
Calculate the
total_points
each species got.
%>%
bird_df # Pivot the data to create a 2 column data frame
pivot_longer(
starts_with("points_"),
names_to = "points",
names_prefix = "points_",
names_transform = list(points = as.integer),
values_to = "species",
values_drop_na = TRUE
%>%
) group_by(species) %>%
summarize(total_points = sum(points)) %>%
slice_max(total_points, n = 5)
## # A tibble: 5 × 2
## species total_points
## <chr> <int>
## 1 Kākāpō 7675
## 2 Yellow-eyed penguin 6554
## 3 Black Robin 4902
## 4 Kererū 4264
## 5 Banded Dotterel 4162
Good job! A well-deserved win for the penguin!
Photo by veraw04ng
2.2.2 Big tech stock prices
You’re an analyst at an investment firm and want to visualize the weekly closing prices of five big tech firms’ stocks.
However, the dataset you’ve been handed (stock_df
) is messy and has the year and week variables stored in the column headers.
You’ll pivot this data into a tidy format, extract the variables from the headers, and create a line plot.
The ggplot2
package has been pre-loaded for you.
-
Inspect
stock_df
in the console. -
Pivot
stock_df
so that the integer columnsyear
andweek
are created from the column names and the original values are moved to theprice
column. Use thenames_sep
argument to separate the column names.
=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRpPo4a7Uj9jAtdmEC3p1YixmewdMQKKsKbuu3CoF4COKX7j-y0FpiRzxLGdaWVmb7mrvWOWEaph558/pub?gid=1049805716&single=true&output=csv")
stock_dfcolnames(stock_df)=c("company",paste(2019,"_week",seq(1:52),sep=""),paste(2020,"_week",seq(1:53),sep=""))
%>%
stock_df # Pivot the data to create 3 new columns: year, week, price
pivot_longer(
-company,
names_to = c("year", "week"),
values_to = "price",
names_sep = "_week",
names_transform = list(
year = as.integer,
week = as.integer)
)
## # A tibble: 525 × 4
## company year week price
## <chr> <int> <int> <dbl>
## 1 Amazon 2019 1 1848.
## 2 Amazon 2019 2 1641.
## 3 Amazon 2019 3 1696.
## 4 Amazon 2019 4 1671.
## 5 Amazon 2019 5 1626.
## 6 Amazon 2019 6 1588.
## 7 Amazon 2019 7 1608.
## 8 Amazon 2019 8 1632.
## 9 Amazon 2019 9 1672.
## 10 Amazon 2019 10 1621.
## # … with 515 more rows
-
Create a line plot where the
price
is shown perweek
and color bycompany
. Theyear
variable has been dealt with for you.
%>%
stock_df # Pivot the data to create 3 new columns: year, week, price
pivot_longer(
-company,
names_to = c("year", "week"),
values_to = "price",
names_sep = "_week",
names_transform = list(
year = as.integer,
week = as.integer)
%>%
) # Create a line plot with price per week, color by company
ggplot(aes(x = week, y = price, color = company)) +
geom_line() +
facet_grid(. ~ year)
Well done! You can now extract multiple variables from column headers. I hope you are starting to see just how powerful the pivot_longer()
function is!
2.3 Deriving variables from complex column headers
2.3.1 Soviet space dogs, the dog perspective
You’ll be working on an pre-processed sample of the USSR space dogs database compiled by Duncan Geere and pre-loaded for you as space_dogs_df
. Each of the 42 rows in this dataset represents a test rocket launch which had one or two very brave dogs on board.
Your goal is to reshape this dataset so that for each launch, each dog has a row.
The challenge is that in the column headers (name_1
, name_2
, gender_1
, and gender_2
), the part before the _
separator can point to two different variables (name and gender), while the second part always points to the dog ID (1st or 2nd dog).
Laika, the first animal to orbit Earth in 1957
-
As the first argument to
pivot_longer()
, pass the columns to pivot (name_1
,name_2
,gender_1
, andgender_2
). -
Complete the
names_to
argument so that the first part of the column headers are reused. -
Make sure
NA
values are dropped since not all rockets had two dogs.
=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrv7ECpC0Jwv5u4Y5KZFZOGJ2SnEruSs7xoGs_ZPLbLS5CI32JVRLAxpv0dyneZovOOQ_jyaqcZeoh/pub?gid=0&single=true&output=csv")
space_dogs_df%>%
space_dogs_df pivot_longer(
# Add the columns to pivot
:gender_2,
name_1names_sep = "_",
# Complete the names_to argument to re-use the first part of the column headers
names_to = c(".value", "dog_id"),
# Make sure NA values are dropped
values_drop_na = TRUE
)
## # A tibble: 84 × 5
## date result dog_id name gender
## <chr> <chr> <chr> <chr> <chr>
## 1 1966-02-22 recovered safely after a 22-day flight 1 "Ugo… "Male"
## 2 1966-02-22 recovered safely after a 22-day flight 2 "Vet… "Male"
## 3 1961-03-25 one orbit, recovered safely 1 "Zve… "Fema…
## 4 1961-03-25 one orbit, recovered safely 2 "" ""
## 5 1961-03-09 one orbit, recovered safely 1 "Che… "Fema…
## 6 1961-03-09 one orbit, recovered safely 2 "" ""
## 7 1960-12-22 upper stage failed, dogs recovered after a su… 1 "Shu… "Fema…
## 8 1960-12-22 upper stage failed, dogs recovered after a su… 2 "Kom… "Fema…
## 9 1960-12-01 spent one day in orbit, capsule was destroyed… 1 "Mus… "Fema…
## 10 1960-12-01 spent one day in orbit, capsule was destroyed… 2 "Pch… "Fema…
## # … with 74 more rows
Well done! Notice how the “name”
and “gender”
parts of the old column headers were reused in two new columns. The
second part of the old headers is found as a variable in the dog_id
column. Since most launches had two dogs, the number of rows almost doubled.
2.3.2 WHO obesity vs. life expectancy
You’ve been given a sample of WHO data (who_df
) with
obesity percentages and life expectancy data per country, year, and sex.
You want to visually inspect the correlation between obesity and life
expectancy.
However, the data is very messy with four variables hidden in the column
names. Each column name is made up of three parts separated by
underscores: Values for the year
, followed by those for sex
, and then values for either pct.obese
or life.exp
. Since the third part of the column name string holds two variables you’ll need to use the special “.value”
value in the names_to
argument.
You’ll pivot the data into a tidy format and create the scatterplot.
The ggplot2
package has been pre-loaded for you.
-
Inspect
who_df
in the console. -
Pivot the data so that each variable (
year
,sex
,pct.obese
,life.exp
) has a column of the correct data type.
#who_df <- read.csv("https://raw.githubusercontent.com/Vishal0229/Data605/master/Week12/who.csv", header =T)
# who_df %>%
# # Put each variable in its own column
# pivot_longer(
# -country,
# names_to = c("year", "sex", ".value"),
# names_sep = "_",
# names_transform = list("year" = as.integer)
# )
-
Use
ggplot()
to create a scatterplot withlife.exp
overpct.obese
. Color the points bysex
.
# who_df %>%
# # Put each variable in its own column
# pivot_longer(
# -country,
# names_to = c("year", "sex", ".value"),
# names_sep = "_",
# names_transform = list("year" = as.integer)
# ) %>%
# # Create a plot with life expectancy over obesity
# ggplot(aes(x = pct.obese, y = life.exp, color = sex))+
# geom_point()
Well done, you’ve transformed an extremely messy dataset with a single call of the pivot_longer()
function! Perhaps surprisingly, countries with low life expectancy
often have low obesity values even though we expected to see a negative
trend. This is probably due to non-obesity related health issues being
more important in these countries.
2.3.3 Uncounting observations
You’ve found the job of your dreams providing technical support for a
dog breed beauty contest. The jury members want a spreadsheet with the
breed and id of each participating dog so that they can add the scores
later on. You’ve only been given the number of participants per dog
breed (dog_df
) so you decide to use your tidyr
skills to create the desired result.
- Inspect the data in the console.
-
Uncount the data so that per breed, each dog gets a row and an ID. The ID should go in the
dog_id
column.
=tribble(~breed, ~n_participants,
dog_df"Husky", 5,
"Golden retriever", 12,
"Poodle", 13,
"Shiba Inu", 7)
%>%
dog_df # Create one row for each participant and add the id
uncount(n_participants, .id = "dog_id")
## # A tibble: 37 × 2
## breed dog_id
## <chr> <int>
## 1 Husky 1
## 2 Husky 2
## 3 Husky 3
## 4 Husky 4
## 5 Husky 5
## 6 Golden retriever 1
## 7 Golden retriever 2
## 8 Golden retriever 3
## 9 Golden retriever 4
## 10 Golden retriever 5
## # … with 27 more rows
May the prettiest dog win! Uncounting data isn’t something you’ll do every day, but in specific cases it’s just what you need.
2.4 From long to wide data
2.4.1 Soviet space dogs, the flight perspective
Remember the USSR space dogs dataset1? You changed it to a long format so that for every dog in every rocket launch, there was a row. Suppose you’re given this tidy dataset and are asked to answer the question, “In what percentage of flights were both dogs of the same gender?”
You’ll reshape and investigate space_dogs_df
to find the answer.
The dplyr
package has been pre-loaded for you.
1 Compiled by Duncan Geere.
-
Pivot the data to a wider format, deriving new column names from the
dog_id
column and values from thegender
column.
=space_dogs_df %>%
space_dogs_df1pivot_longer(
# Add the columns to pivot
:gender_2,
name_1names_sep = "_",
# Complete the names_to argument to re-use the first part of the column headers
names_to = c(".value", "dog_id"),
# Make sure NA values are dropped
values_drop_na = TRUE
)
%>%
space_dogs_df1 # Pivot the data to a wider format
pivot_wider(
names_from = dog_id,
values_from = gender,
names_prefix = 'gender_'
)
## # A tibble: 84 × 5
## date result name gender_1 gender_2
## <chr> <chr> <chr> <chr> <chr>
## 1 1966-02-22 recovered safely after a 22-day flight "Ugo… Male <NA>
## 2 1966-02-22 recovered safely after a 22-day flight "Vet… <NA> "Male"
## 3 1961-03-25 one orbit, recovered safely "Zve… Female <NA>
## 4 1961-03-25 one orbit, recovered safely "" <NA> ""
## 5 1961-03-09 one orbit, recovered safely "Che… Female <NA>
## 6 1961-03-09 one orbit, recovered safely "" <NA> ""
## 7 1960-12-22 upper stage failed, dogs recovered after … "Shu… Female <NA>
## 8 1960-12-22 upper stage failed, dogs recovered after … "Kom… <NA> "Female"
## 9 1960-12-01 spent one day in orbit, capsule was destr… "Mus… Female <NA>
## 10 1960-12-01 spent one day in orbit, capsule was destr… "Pch… <NA> "Female"
## # … with 74 more rows
-
Drop rows that contain
NA
values.
=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrv7ECpC0Jwv5u4Y5KZFZOGJ2SnEruSs7xoGs_ZPLbLS5CI32JVRLAxpv0dyneZovOOQ_jyaqcZeoh/pub?gid=1588863802&single=true&output=csv")
space_dogs_df%>%
space_dogs_df # Pivot the data to a wider format
pivot_wider(names_from = dog_id, values_from = gender, names_prefix = "gender_") %>%
# Drop rows with NA values
drop_na()
## # A tibble: 39 × 3
## date gender_1 gender_2
## <int> <chr> <chr>
## 1 18831 Male Male
## 2 18838 Male Female
## 3 18855 Male Male
## 4 18859 Male Male
## 5 18868 Male Male
## 6 18874 Male Male
## 7 18805 Female Male
## 8 19907 Female Male
## 9 19912 Female Male
## 10 19931 Female Male
## # … with 29 more rows
-
Create a new column
same_gender
, which has aTRUE
value whengender_1
equalsgender_2
.
%>%
space_dogs_df # Pivot the data to a wider format
pivot_wider(names_from = dog_id, values_from = gender, names_prefix = "gender_") %>%
# Drop rows with NA values
drop_na() %>%
# Create a Boolean column on whether both dogs have the same gender
mutate(same_gender = gender_1 == gender_2) %>%
summarize(pct_same_gender = mean(same_gender))
## # A tibble: 1 × 1
## pct_same_gender
## <dbl>
## 1 0.795
Good job! The answer to the question is 80%. By switching back to wide
format, it was easy to remove flights with just one dog and calculate
per-flight properties. If you’ve been using tidyr
for some time, prior to version 1.0, the pivot_wider()
function was named spread()
.
2.4.2 Planet temperature & distance to the Sun
The intensity of light radiated by a light source follows an inverse square relationship with the distance it has traveled.
You wonder if you could observe this trend in the temperature of the
planets in our Solar System given their distance to the Sun. You’ll use
the
planet_df
dataset from the devstronomy project to investigate this.
-
Inspect
planet_df
in the console. -
Use the
pivot_wider()
function to extract column names from themetric
column and values from thevalue
column.
=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/planet_metrics_long.rds")
planet_df%>%
planet_df # Give each planet variable its own column
pivot_wider(names_from = "metric", values_from = "value")
## # A tibble: 8 × 4
## planet diameter distance_to_sun temperature
## <chr> <dbl> <dbl> <dbl>
## 1 Mercury 4879 57.9 167
## 2 Venus 12104 108. 464
## 3 Earth 12756 150. 15
## 4 Mars 6792 228. -65
## 5 Jupiter 142984 779. -110
## 6 Saturn 120536 1434. -140
## 7 Uranus 51118 2872. -195
## 8 Neptune 49528 4495. -200
-
Use the
ggplot()
function to create a plot with thetemperature
over thedistance_to_sun
.
%>%
planet_df # Give each planet variable its own column
pivot_wider(names_from = "metric", values_from = "value") %>%
# Plot planet temperature over distance to sun
ggplot(aes(x = distance_to_sun, y = temperature)) +
geom_point(aes(size = diameter)) +
geom_text(aes(label = planet), vjust = -1) +
labs(x = "Distance to sun (million km)",
y = "Mean temperature (°C)") +
theme(legend.position = "none")
Stellar! The inverse square relation is present indeed. Only Venus doesn’t follow the trend as it is further from the Sun than Mercury but also hotter. Turns out it has a lot of CO2 in its atmosphere trapping the heat. Let’s not become like Venus ;)
2.4.3 Transposing planet data
You’re again working on a planet dataset derived from the devstronomy project. This time, you’re interested in the correlation between the diameter of a planet and the number of moons circling it.
However, the dataset (planet_df
) has a row for each
variable and a column for each planet (observation). You’ll transpose
this data in two steps and then create a plot to inspect the
correlation.
The ggplot2
package has been pre-loaded for you.
-
Inspect
planet_df
in the console. -
Pivot the data so that planet names are put in a column named
planet
.
=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/planet_metrics_wide.rds")
planet_df
%>%
planet_df # Pivot all columns except metric to long format
pivot_longer(-metric, names_to = "planet")
## # A tibble: 40 × 3
## metric planet value
## <chr> <chr> <dbl>
## 1 mass Mercury 0.33
## 2 mass Venus 4.87
## 3 mass Earth 5.97
## 4 mass Mars 0.642
## 5 mass Jupiter 1898
## 6 mass Saturn 568
## 7 mass Uranus 86.8
## 8 mass Neptune 102
## 9 diameter Mercury 4879
## 10 diameter Venus 12104
## # … with 30 more rows
-
Pivot the data so that each variable in the
metric
column gets its own column.
%>%
planet_df # Pivot all columns except metric to long format
pivot_longer(-metric, names_to = "planet") %>%
# Put each metric in its own column
pivot_wider(names_from = metric, values_from = value)
## # A tibble: 8 × 6
## planet mass diameter distance_from_sun gravity number_of_moons
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mercury 0.33 4879 57.9 3.7 0
## 2 Venus 4.87 12104 108. 8.9 0
## 3 Earth 5.97 12756 150. 9.8 1
## 4 Mars 0.642 6792 228. 3.7 2
## 5 Jupiter 1898 142984 779. 23.1 79
## 6 Saturn 568 120536 1434. 9 62
## 7 Uranus 86.8 51118 2872. 8.7 27
## 8 Neptune 102 49528 4495. 11 14
-
Use the
ggplot()
function to create a plot with thenumber_of_moons
overdiameter
.
%>%
planet_df # Pivot all columns except metric to long format
pivot_longer(-metric, names_to = "planet") %>%
# Put each metric in its own column
pivot_wider(names_from = metric, values_from = value) %>%
# Plot the number of moons vs planet diameter
ggplot(aes(x = diameter, y = number_of_moons)) +
geom_point(aes(size = diameter)) +
geom_text(aes(label = planet), vjust = -1) +
labs(x = "Diameter (km)", y = "Number of moons") +
theme(legend.position = "none")
Well done! It’s clear that bigger planets tend to have more moons. Once you’re able to split the messiness of a dataset into smaller, easier to fix problems, you’ve become a data wrangling master.
3 Expanding data
Values can often be missing in your data, and sometimes entire observations are absent too. In this chapter, you’ll learn how to complete your dataset with these missing observations. You’ll add observations with zero values to counted data, expand time series to a full sequence of intervals, and more!
3.1 Creating unique combinations of vectors
3.1.1 Letters of the genetic code
The basic building blocks of RNA are four molecules described by a single letter each: adenine (A), cytosine (C), guanine (G), and uracil (U). The information carried by an RNA strand can be represented as a long sequence of these four letters. To read this code, one has to divide this chain into sequences of three letters each (e.g. GCU, ACG, …). These three letter sequences are known as codons. The concept is illustrated in the image below.
Your goal for this exercise is to create a data frame with all possible three letter sequences (codons) from a vector with the four letters representing the RNA building blocks.
-
Create a tibble with three columns called
letter1
,letter2
, andletter3
that holds all possible combinations of the vectorletters
usingexpand_grid()
.
<- c("A", "C", "G", "U")
letters
# Create a tibble with all possible 3 way combinations
<- expand_grid(
codon_df letter1 = letters,
letter2 = letters,
letter3 = letters
)
codon_df
## # A tibble: 64 × 3
## letter1 letter2 letter3
## <chr> <chr> <chr>
## 1 A A A
## 2 A A C
## 3 A A G
## 4 A A U
## 5 A C A
## 6 A C C
## 7 A C G
## 8 A C U
## 9 A G A
## 10 A G C
## # … with 54 more rows
-
Use the
unite()
function from chapter one to merge these three columns into a single column namedcodon
. Use an empty string as the separator.
<- c("A", "C", "G", "U")
letters
# Create a tibble with all possible 3 way combinations
<- expand_grid(
codon_df letter1 = letters,
letter2 = letters,
letter3 = letters
)
%>%
codon_df # Unite these three columns into a "codon" column
unite("codon", letter1:letter3, sep= "")
## # A tibble: 64 × 1
## codon
## <chr>
## 1 AAA
## 2 AAC
## 3 AAG
## 4 AAU
## 5 ACA
## 6 ACC
## 7 ACG
## 8 ACU
## 9 AGA
## 10 AGC
## # … with 54 more rows
Well done! There are 64 possible codons (64=43). They encode 20 amino acids, the building blocks of proteins and, thus, you!
3.1.2 When did humans replace dogs in space?
You already know that in the early days of spaceflight, the USSR was
testing rockets with dogs. You now wonder when exactly humans started
replacing dogs on space flight missions. You’ve been given a dataset space_df
with the number of both dogs (compiled by Duncan Geere) and humans in space per year from 1951 till 1970 (collected from Wikipedia).
Your goal is to create a plot that shows you the number of individuals
sent into space per species. Before you can create this plot, you’ll
first have to introduce zero values for missing combinations of year
and species
.
The dplyr
and ggplot2
packages have been pre-loaded for you.
-
Create
full_df
, a tibble with all unique combinations of the variablesyear
(from 1951 to 1970) andspecies
(“Human”
and“Dog”
).
# Create a tibble with all combinations of years and species
<- expand_grid(
full_df year = 1951:1970,
species = c("Human", "Dog")
)
full_df
## # A tibble: 40 × 2
## year species
## <int> <chr>
## 1 1951 Human
## 2 1951 Dog
## 3 1952 Human
## 4 1952 Dog
## 5 1953 Human
## 6 1953 Dog
## 7 1954 Human
## 8 1954 Dog
## 9 1955 Human
## 10 1955 Dog
## # … with 30 more rows
-
Perform a
right_join()
betweenspace_df
andfull_df
on theyear
andspecies
columns.
=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrv7ECpC0Jwv5u4Y5KZFZOGJ2SnEruSs7xoGs_ZPLbLS5CI32JVRLAxpv0dyneZovOOQ_jyaqcZeoh/pub?gid=1478550685&single=true&output=csv")
space_df%>%
space_df # Join with full_df so that missing values are introduced
right_join(full_df, by = c("year", "species")) %>%
arrange(year)
## year species n_in_space
## 1 1951 Dog 14
## 2 1951 Human NA
## 3 1952 Human NA
## 4 1952 Dog NA
## 5 1953 Human NA
## 6 1953 Dog NA
## 7 1954 Dog 6
## 8 1954 Human NA
## 9 1955 Dog 6
## 10 1955 Human NA
## 11 1956 Dog 8
## 12 1956 Human NA
## 13 1957 Dog 11
## 14 1957 Human NA
## 15 1958 Dog 12
## 16 1958 Human NA
## 17 1959 Dog 4
## 18 1959 Human NA
## 19 1960 Dog 16
## 20 1960 Human NA
## 21 1961 Dog 2
## 22 1961 Human 4
## 23 1962 Human 5
## 24 1962 Dog NA
## 25 1963 Human 5
## 26 1963 Dog NA
## 27 1964 Human 3
## 28 1964 Dog NA
## 29 1965 Human 12
## 30 1965 Dog NA
## 31 1966 Dog 2
## 32 1966 Human 10
## 33 1967 Human 1
## 34 1967 Dog NA
## 35 1968 Human 7
## 36 1968 Dog NA
## 37 1969 Human 23
## 38 1969 Dog NA
## 39 1970 Human 5
## 40 1970 Dog NA
-
Use the
ggplot()
function to create a line plot ofn_in_space
overyear
, colored byspecies
.
%>%
space_df # Join with full_df so that missing values are introduced
right_join(full_df, by = c("year", "species")) %>%
# Create a line plot with n_in_space over year, color by species
ggplot(aes(x = year, y = n_in_space, color = species)) +
geom_line()
## Warning: Removed 14 row(s) containing missing values (geom_path).
-
Use the
replace_na()
function to overwriteNA
values in then_in_space
column with zeros.
%>%
space_df # Join with full_df so that missing values are introduced
right_join(full_df, by = c("year", "species")) %>%
# Overwrite NA values for n_in_space with 0L
replace_na(list(n_in_space = 0L)) %>%
# Create a line plot with n_in_space over year, color by species
ggplot(aes(x = year, y = n_in_space, color = species)) +
geom_line()
Good job! It looks like 1961 was the year humans replaced dogs in space. Adding zero observations allowed you to create a correct and complete plot.
3.1.3 Finding missing observations
You’re an inspector at a nuclear plant and have to validate whether
every reactor has received its daily safety check over the course of a
full year. The safety check logs are in reactor_df
, a data frame with columns date
, reactor
, and check
.
Two vectors, dates
and reactors
, with all dates of the year and reactors at the plant respectively have been created for you.
You’ll use the combination of the expand_grid()
and anti_join()
functions to find dates where particular reactors were not checked.
The dplyr
package has been pre-loaded for you.
expand_grid()
function to create a tibble holding all combinations of the variables date
and reactor
. Use the dates
and reactors
vectors created for you.
=seq(as.Date("1986-01-01"), as.Date("1986-12-31"), "days")
dates=c("A","B","C","D")
reactors# Create a tibble with all combinations of dates and reactors
<- expand_grid(date = dates, reactor = reactors) full_df
full_df
and reactor_df
on the date
and reactor
columns.
# Find the reactor - date combinations not present in reactor_df
#full_df %>%
# anti_join(reactor_df, by = c("date", "reactor"))
Well done, looks like safety was not optimal in April 1986. Sometimes the observations not in the data are more important than the ones present. The trick you just learned can help you find them.
3.2 Completing data with all value combinations
3.2.1 Completing the Solar System
You have been given a data frame (planet_df
) from the devstronomy project
with the number of moons per planet in our Solar System. However,
Mercury and Venus, the two moonless planets, are absent. You want to
expand this dataset using the complete()
function and a vector planets
that contains all eight planet’s names.
planet
variable using the planets
vector.
= c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune") planets
NA
values in the n_moons
variable with 0L
values.
=tribble(~planet, ~n_moons,
planet_df"Earth", 1,
"Mars", 2,
"Jupiter", 79,
"Saturn", 62,
"Uranus", 27,
"Neptune", 14)
%>%
planet_df complete(
# Complete the planet variable
planet = planets,
# Overwrite NA values for n_moons with 0L
fill = list(n_moons = 0L)
)
## # A tibble: 8 × 2
## planet n_moons
## <chr> <dbl>
## 1 Earth 1
## 2 Jupiter 79
## 3 Mars 2
## 4 Mercury 0
## 5 Neptune 14
## 6 Saturn 62
## 7 Uranus 27
## 8 Venus 0
Well done! By completing with a vector of all known planets, you can be sure that each will have an observation in the output. Did you know Jupiter is circled by at least 79 moons? This number is likely to increase as new ones continue to be discovered every few years.
3.2.2 Zero Olympic medals
Since 1896, athletes from all over the world have been competing in the modern Olympic games. You’ve been given a dataset (medal_df
)
with observations for all medals won by athletes from the 10 most
successful countries in Olympic history. You want to create a visual
with the number of medals won per country (team
) per year
.
However, since not all countries won medals each year, you’ll have to
introduce zero values before you can make an accurate visual.
The ggplot2
and dplyr
packages have been pre-loaded for you. In step 2 and 3 the scale_color_brewer()
function is used to color lines in the plot with a palette that makes it easier to distinguish the different countries.
-
Inspect
medal_df
in the console. -
Count the number of medals won per
team
andyear
.
=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/olympic_medals_top_10.rds")
medal_df%>%
medal_df # Count the medals won per team and year
count(team, year, name = "n_medals")
## # A tibble: 231 × 3
## team year n_medals
## <chr> <dbl> <int>
## 1 Australia 1896 2
## 2 Australia 1900 5
## 3 Australia 1904 4
## 4 Australia 1906 3
## 5 Australia 1920 7
## 6 Australia 1924 10
## 7 Australia 1928 4
## 8 Australia 1932 5
## 9 Australia 1936 1
## 10 Australia 1948 16
## # … with 221 more rows
-
Use
ggplot()
to create a line plot withn_medals
overyear
, colored byteam
.
%>%
medal_df # Count the medals won per team and year
count(team, year, name = "n_medals") %>%
# Plot n_medals over year, colored by team
ggplot(aes(x = year, y = n_medals, color = team)) +
geom_line() +
scale_color_brewer(palette = "Paired")
-
Complete the
team
andyear
variables, replaceNA
values in then_medals
column with zeros.
%>%
medal_df # Count the medals won per team and year
count(team, year, name = "n_medals") %>%
# Complete the team and year variables, fill n_medals with zeros
complete(team, year, fill = list(n_medals = 0)) %>%
# Plot n_medals over year, colored by team
ggplot(aes(x = year, y = n_medals, color = team)) +
geom_line() +
scale_color_brewer(palette = "Paired")
Well done! Zero values are often absent in counted data which can lead to buggy visualizations. We can now see the USA boycott of the 1980 Olympics in Moscow and the subsequent USSR boycott of the 1984 Olympics in Los Angeles.
3.2.3 Creating a sequence with full_seq()
The full_seq()
function will look for the minimal and
maximal values inside the vector you pass it and will then generate a
full sequence of numbers with a fixed period
in between them.
When used inside the complete()
function, full_seq()
is a handy tool to make sure there are no missing observations in your
data. Before combining these two functions you’ll generate a few
sequences with full_seq()
on its own to get the hang of this function.
-
Use
full_seq()
to create a sequence with all years from 2020 till 2030.
# Generate all years from 2020 to 2030
<- full_seq(c(2020, 2030), period = 1)
years years
## [1] 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030
-
Use
full_seq()
to create a sequence with all decades from 1980 till 2030.
# Generate all decades from 1980 to 2030
<- full_seq(c(1980, 2030), period = 10)
decades decades
## [1] 1980 1990 2000 2010 2020 2030
-
Use
full_seq()
to create a sequence with all dates in 1980 using theouter_dates
vector.
<- c(as.Date("1980-01-01"), as.Date("1980-12-31"))
outer_dates
# Generate the dates for all days in 1980
full_seq(outer_dates, period = 1)
## [1] "1980-01-01" "1980-01-02" "1980-01-03" "1980-01-04" "1980-01-05"
## [6] "1980-01-06" "1980-01-07" "1980-01-08" "1980-01-09" "1980-01-10"
## [11] "1980-01-11" "1980-01-12" "1980-01-13" "1980-01-14" "1980-01-15"
## [16] "1980-01-16" "1980-01-17" "1980-01-18" "1980-01-19" "1980-01-20"
## [21] "1980-01-21" "1980-01-22" "1980-01-23" "1980-01-24" "1980-01-25"
## [26] "1980-01-26" "1980-01-27" "1980-01-28" "1980-01-29" "1980-01-30"
## [31] "1980-01-31" "1980-02-01" "1980-02-02" "1980-02-03" "1980-02-04"
## [36] "1980-02-05" "1980-02-06" "1980-02-07" "1980-02-08" "1980-02-09"
## [41] "1980-02-10" "1980-02-11" "1980-02-12" "1980-02-13" "1980-02-14"
## [46] "1980-02-15" "1980-02-16" "1980-02-17" "1980-02-18" "1980-02-19"
## [51] "1980-02-20" "1980-02-21" "1980-02-22" "1980-02-23" "1980-02-24"
## [56] "1980-02-25" "1980-02-26" "1980-02-27" "1980-02-28" "1980-02-29"
## [61] "1980-03-01" "1980-03-02" "1980-03-03" "1980-03-04" "1980-03-05"
## [66] "1980-03-06" "1980-03-07" "1980-03-08" "1980-03-09" "1980-03-10"
## [71] "1980-03-11" "1980-03-12" "1980-03-13" "1980-03-14" "1980-03-15"
## [76] "1980-03-16" "1980-03-17" "1980-03-18" "1980-03-19" "1980-03-20"
## [81] "1980-03-21" "1980-03-22" "1980-03-23" "1980-03-24" "1980-03-25"
## [86] "1980-03-26" "1980-03-27" "1980-03-28" "1980-03-29" "1980-03-30"
## [91] "1980-03-31" "1980-04-01" "1980-04-02" "1980-04-03" "1980-04-04"
## [96] "1980-04-05" "1980-04-06" "1980-04-07" "1980-04-08" "1980-04-09"
## [101] "1980-04-10" "1980-04-11" "1980-04-12" "1980-04-13" "1980-04-14"
## [106] "1980-04-15" "1980-04-16" "1980-04-17" "1980-04-18" "1980-04-19"
## [111] "1980-04-20" "1980-04-21" "1980-04-22" "1980-04-23" "1980-04-24"
## [116] "1980-04-25" "1980-04-26" "1980-04-27" "1980-04-28" "1980-04-29"
## [121] "1980-04-30" "1980-05-01" "1980-05-02" "1980-05-03" "1980-05-04"
## [126] "1980-05-05" "1980-05-06" "1980-05-07" "1980-05-08" "1980-05-09"
## [131] "1980-05-10" "1980-05-11" "1980-05-12" "1980-05-13" "1980-05-14"
## [136] "1980-05-15" "1980-05-16" "1980-05-17" "1980-05-18" "1980-05-19"
## [141] "1980-05-20" "1980-05-21" "1980-05-22" "1980-05-23" "1980-05-24"
## [146] "1980-05-25" "1980-05-26" "1980-05-27" "1980-05-28" "1980-05-29"
## [151] "1980-05-30" "1980-05-31" "1980-06-01" "1980-06-02" "1980-06-03"
## [156] "1980-06-04" "1980-06-05" "1980-06-06" "1980-06-07" "1980-06-08"
## [161] "1980-06-09" "1980-06-10" "1980-06-11" "1980-06-12" "1980-06-13"
## [166] "1980-06-14" "1980-06-15" "1980-06-16" "1980-06-17" "1980-06-18"
## [171] "1980-06-19" "1980-06-20" "1980-06-21" "1980-06-22" "1980-06-23"
## [176] "1980-06-24" "1980-06-25" "1980-06-26" "1980-06-27" "1980-06-28"
## [181] "1980-06-29" "1980-06-30" "1980-07-01" "1980-07-02" "1980-07-03"
## [186] "1980-07-04" "1980-07-05" "1980-07-06" "1980-07-07" "1980-07-08"
## [191] "1980-07-09" "1980-07-10" "1980-07-11" "1980-07-12" "1980-07-13"
## [196] "1980-07-14" "1980-07-15" "1980-07-16" "1980-07-17" "1980-07-18"
## [201] "1980-07-19" "1980-07-20" "1980-07-21" "1980-07-22" "1980-07-23"
## [206] "1980-07-24" "1980-07-25" "1980-07-26" "1980-07-27" "1980-07-28"
## [211] "1980-07-29" "1980-07-30" "1980-07-31" "1980-08-01" "1980-08-02"
## [216] "1980-08-03" "1980-08-04" "1980-08-05" "1980-08-06" "1980-08-07"
## [221] "1980-08-08" "1980-08-09" "1980-08-10" "1980-08-11" "1980-08-12"
## [226] "1980-08-13" "1980-08-14" "1980-08-15" "1980-08-16" "1980-08-17"
## [231] "1980-08-18" "1980-08-19" "1980-08-20" "1980-08-21" "1980-08-22"
## [236] "1980-08-23" "1980-08-24" "1980-08-25" "1980-08-26" "1980-08-27"
## [241] "1980-08-28" "1980-08-29" "1980-08-30" "1980-08-31" "1980-09-01"
## [246] "1980-09-02" "1980-09-03" "1980-09-04" "1980-09-05" "1980-09-06"
## [251] "1980-09-07" "1980-09-08" "1980-09-09" "1980-09-10" "1980-09-11"
## [256] "1980-09-12" "1980-09-13" "1980-09-14" "1980-09-15" "1980-09-16"
## [261] "1980-09-17" "1980-09-18" "1980-09-19" "1980-09-20" "1980-09-21"
## [266] "1980-09-22" "1980-09-23" "1980-09-24" "1980-09-25" "1980-09-26"
## [271] "1980-09-27" "1980-09-28" "1980-09-29" "1980-09-30" "1980-10-01"
## [276] "1980-10-02" "1980-10-03" "1980-10-04" "1980-10-05" "1980-10-06"
## [281] "1980-10-07" "1980-10-08" "1980-10-09" "1980-10-10" "1980-10-11"
## [286] "1980-10-12" "1980-10-13" "1980-10-14" "1980-10-15" "1980-10-16"
## [291] "1980-10-17" "1980-10-18" "1980-10-19" "1980-10-20" "1980-10-21"
## [296] "1980-10-22" "1980-10-23" "1980-10-24" "1980-10-25" "1980-10-26"
## [301] "1980-10-27" "1980-10-28" "1980-10-29" "1980-10-30" "1980-10-31"
## [306] "1980-11-01" "1980-11-02" "1980-11-03" "1980-11-04" "1980-11-05"
## [311] "1980-11-06" "1980-11-07" "1980-11-08" "1980-11-09" "1980-11-10"
## [316] "1980-11-11" "1980-11-12" "1980-11-13" "1980-11-14" "1980-11-15"
## [321] "1980-11-16" "1980-11-17" "1980-11-18" "1980-11-19" "1980-11-20"
## [326] "1980-11-21" "1980-11-22" "1980-11-23" "1980-11-24" "1980-11-25"
## [331] "1980-11-26" "1980-11-27" "1980-11-28" "1980-11-29" "1980-11-30"
## [336] "1980-12-01" "1980-12-02" "1980-12-03" "1980-12-04" "1980-12-05"
## [341] "1980-12-06" "1980-12-07" "1980-12-08" "1980-12-09" "1980-12-10"
## [346] "1980-12-11" "1980-12-12" "1980-12-13" "1980-12-14" "1980-12-15"
## [351] "1980-12-16" "1980-12-17" "1980-12-18" "1980-12-19" "1980-12-20"
## [356] "1980-12-21" "1980-12-22" "1980-12-23" "1980-12-24" "1980-12-25"
## [361] "1980-12-26" "1980-12-27" "1980-12-28" "1980-12-29" "1980-12-30"
## [366] "1980-12-31"
Well done! Now that you know how the full_seq()
function works, you can start plugging it into the complete()
function.
3.2.4 The Cold War’s hottest year
In October 1962, during the Cuban missile crisis,
the world came close to a full scale nuclear war. Throughout 1962, the
USA, USSR, and France together detonated a record 178 nuclear bombs for
military power display and research. You’ve been given a sample of the
Nuclear Explosion Database (NEDB) for that year (cumul_nukes_1962_df
) with an observation for each date
on which a bomb was detonated. The total_bombs
variable contains the cumulative number of bombs detonated by a country up to that point in time.
You’ll complete the dataset to hold the full sequence of dates, and
visualize the total number of bombs per country over time. You’ll also
use the fill()
function from Chapter One to impute missing values.
The dplyr
and ggplot2
packages have been pre-loaded for you.
-
Complete the dataset so that for each
country
there is an observation of eachdate
using thefull_seq()
function.
=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/nukes_1962.rds")
cumul_nukes_1962_df%>%
cumul_nukes_1962_df # Complete the dataset
complete(country, date = full_seq(date, period = 1))
## # A tibble: 1,077 × 3
## country date total_bombs
## <chr> <date> <int>
## 1 France 1962-01-01 0
## 2 France 1962-01-02 NA
## 3 France 1962-01-03 NA
## 4 France 1962-01-04 NA
## 5 France 1962-01-05 NA
## 6 France 1962-01-06 NA
## 7 France 1962-01-07 NA
## 8 France 1962-01-08 NA
## 9 France 1962-01-09 NA
## 10 France 1962-01-10 NA
## # … with 1,067 more rows
-
Group the data by
country
. -
Use the
fill()
function to overwriteNA
values in thetotal_bombs
variable with the last known value.
%>%
cumul_nukes_1962_df # Complete the dataset
complete(country, date = full_seq(date, period = 1)) %>%
# Group the data by country
group_by(country) %>%
# Impute missing values with the last known observation
fill(total_bombs)
## # A tibble: 1,077 × 3
## # Groups: country [3]
## country date total_bombs
## <chr> <date> <int>
## 1 France 1962-01-01 0
## 2 France 1962-01-02 0
## 3 France 1962-01-03 0
## 4 France 1962-01-04 0
## 5 France 1962-01-05 0
## 6 France 1962-01-06 0
## 7 France 1962-01-07 0
## 8 France 1962-01-08 0
## 9 France 1962-01-09 0
## 10 France 1962-01-10 0
## # … with 1,067 more rows
-
Use
ggplot()
to visualize thetotal_bombs
at any givendate
, color the line plot bycountry
. Some code has been added for you to visualize the Cuban Missile Crisis.
%>%
cumul_nukes_1962_df # Complete the dataset
complete(country, date = full_seq(date, period = 1)) %>%
# Group the data by country
group_by(country) %>%
# Impute missing values with the last known observation
fill(total_bombs) %>%
# Plot the number of bombs over time, color by country
ggplot(aes(date, total_bombs, color = country)) +
# These two lines will mark the Cuban Missile Crisis
geom_rect(xmin = as.Date("1962-10-16"), xmax = as.Date("1962-10-29"), ymin = -Inf, ymax = Inf, color = NA)+
geom_text(x = as.Date("1962-10-22"), y = 15, label = "Cuban Missile Crisis", angle = 90, color = "white")+
geom_line()
Good job! Looks like the USSR only started running experiments in the second part of the year. The full_seq()
function allowed you to add dates not yet seen in the data.
3.3 Advanced completions
3.3.1 Olympic medals per continent
You want to compare Olympic performance of athletes per continent over
time, both on the winter and summer Olympics. You’ve been given a
dataset medal_df
with the average number of medals won per
participant of each continent since 1928. You’ll complete this data to
introduce zero values for years where a continent did not win any
medals.
The ggplot2
package has been pre-loaded for you.
-
Complete the dataset so that each
continent
has amedals_per_participant
value at each Olympic event. Missing values should be filled with zeros. -
Nest the
season
andyear
variables using thenesting()
function, since the summer and winter Olympics don’t occur in the same years. -
Use
ggplot()
to create a line plot with themedals_per_participant
peryear
, color the plot bycontinent
.
=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQ7ryL3YFNP4S9bw-F31Ejavx2ZJIBM5yWWVclfVSONfXMeTzaJaIe8yeCkma0FJuudkTUbIbRyWlDz/pub?gid=1081899654&single=true&output=csv")
medal_df%>%
medal_df # Give each continent an observation at each Olympic event
complete(
continent, nesting(season, year),
fill = list(medals_per_participant = 0)
%>%
) # Plot the medals_per_participant over time, colored by continent
ggplot(aes(x = year, y = medals_per_participant, color = continent)) +
geom_line() +
facet_grid(season ~ .)
Well done! Since the nineties, North American athletes have been most proficient, especially in the winter Olympics. Nesting variables while completing is important to avoid adding nonsensical observations to your data, like summer Olympics in years where they did not happen.
3.3.2 Tracking a virus outbreak
You’re a doctor in a remote village confronted with a virus outbreak.
You have been collecting data on when your patients got infected and
recovered in a data frame named patient_df
. Your goal is to
create a visual with the number of sick patients over time. You’ll
first have to reshape the data so that you can count the number of sick
patients per day.
The dplyr
and ggplot2
packages have been pre-loaded for you.
-
Inspect
patient_df
in the console. -
Pivot the
infected
andrecovered
columns to long format, the old column names should go in thestatus
variable, the values todate
.
=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQ7ryL3YFNP4S9bw-F31Ejavx2ZJIBM5yWWVclfVSONfXMeTzaJaIe8yeCkma0FJuudkTUbIbRyWlDz/pub?gid=1724949310&single=true&output=csv")
patient_df=patient_df %>% mutate(infected=as.Date(infected),recovered=as.Date(recovered))
patient_df%>%
patient_df # Pivot the infected and recovered columns to long format
pivot_longer(-patient, names_to = "status", values_to = "date")
## # A tibble: 52 × 3
## patient status date
## <chr> <chr> <date>
## 1 A infected 2020-02-08
## 2 A recovered 2020-03-28
## 3 B infected 2020-02-19
## 4 B recovered 2020-03-04
## 5 C infected 2020-02-27
## 6 C recovered 2020-03-19
## 7 D infected 2020-03-05
## 8 D recovered 2020-03-19
## 9 E infected 2020-03-06
## 10 E recovered 2020-03-29
## # … with 42 more rows
-
Group the data by
patient
and then complete thedate
column so that each date between infection and recovery is added using thefull_seq()
column. At the end, ungroup the data.
%>%
patient_df # Pivot the infected and recovered columns to long format
pivot_longer(-patient, names_to = "status", values_to = "date") %>%
select(-status) %>%
# Group by patient
group_by(patient) %>%
# Complete the date range per patient using full_seq()
complete(date = full_seq(date, period = 1)) %>%
# Ungroup the data
ungroup()
## # A tibble: 684 × 2
## patient date
## <chr> <date>
## 1 A 2020-02-08
## 2 A 2020-02-09
## 3 A 2020-02-10
## 4 A 2020-02-11
## 5 A 2020-02-12
## 6 A 2020-02-13
## 7 A 2020-02-14
## 8 A 2020-02-15
## 9 A 2020-02-16
## 10 A 2020-02-17
## # … with 674 more rows
-
Each
date
is now a day on which a patient was sick, count the dates and name the new variablen_sick
.
%>%
patient_df # Pivot the infected and recovered columns to long format
pivot_longer(-patient, names_to = "status", values_to = "date") %>%
select(-status) %>%
# Group by patient
group_by(patient) %>%
# Complete the date range per patient using full_seq()
complete(date = full_seq(date, period = 1)) %>%
# Ungroup the data
ungroup() %>%
# Count the dates, the count goes in the n_sick variable
count(date, name = "n_sick") %>%
ggplot(aes(x = date, y = n_sick))+
geom_line()
Good job! Combining group_by()
with tidyr
functions further broadens the range of manipulations you can do.
3.3.3 Counting office occupants
Imagine you’re an office facility manager and want to know how many people are present throughout the day. You’ve installed a sensor at the entrance that counts the number of people entering and leaving the building. The sensor sends an update at the end of every 20 minute time slot if at least one person passed.
To create a dataset ready for visualization, you’ll combine the different techniques you’ve learned so far.
The dplyr
and ggplot2
packages have been pre-loaded for you.
-
Complete the
time
variable by using theseq()
function to create a sequence between the min and max values with an interval of“20 min”
. FillNA
values ofenter
andexit
with0L
.
=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQ7ryL3YFNP4S9bw-F31Ejavx2ZJIBM5yWWVclfVSONfXMeTzaJaIe8yeCkma0FJuudkTUbIbRyWlDz/pub?gid=1171603308&single=true&output=csv")
sensor_df=sensor_df %>% mutate(time=as.POSIXct(time))
sensor_df%>%
sensor_df # Complete the time column with a 20 minute interval
complete(time = seq(min(time), max(time), by = "20 min"),
fill = list(enter = 0L, exit = 0L))
## # A tibble: 36 × 4
## time id enter exit
## <dttm> <int> <int> <int>
## 1 2020-01-01 07:00:00 1 1 0
## 2 2020-01-01 07:20:00 2 2 0
## 3 2020-01-01 07:40:00 3 3 -1
## 4 2020-01-01 08:00:00 4 5 0
## 5 2020-01-01 08:20:00 5 3 0
## 6 2020-01-01 08:40:00 6 4 -1
## 7 2020-01-01 09:00:00 7 4 0
## 8 2020-01-01 09:20:00 8 1 0
## 9 2020-01-01 09:40:00 9 1 -1
## 10 2020-01-01 10:00:00 NA 0 0
## # … with 26 more rows
-
Calculate the
total_inside
variable by taking the cumulative sum ofenter
plusexit
using thecumsum()
function.
%>%
sensor_df # Complete the time column with a 20 minute interval
complete(time = seq(min(time), max(time), by = "20 min"),
fill = list(enter = 0L, exit = 0L)) %>%
# Calculate the total number of people inside
mutate(total_inside = cumsum(enter + exit))
## # A tibble: 36 × 5
## time id enter exit total_inside
## <dttm> <int> <int> <int> <int>
## 1 2020-01-01 07:00:00 1 1 0 1
## 2 2020-01-01 07:20:00 2 2 0 3
## 3 2020-01-01 07:40:00 3 3 -1 5
## 4 2020-01-01 08:00:00 4 5 0 10
## 5 2020-01-01 08:20:00 5 3 0 13
## 6 2020-01-01 08:40:00 6 4 -1 16
## 7 2020-01-01 09:00:00 7 4 0 20
## 8 2020-01-01 09:20:00 8 1 0 21
## 9 2020-01-01 09:40:00 9 1 -1 21
## 10 2020-01-01 10:00:00 NA 0 0 21
## # … with 26 more rows
-
Pivot the
enter
andexit
columns to long format. The column names should go in thedirection
variable, the values inn_people
.
%>%
sensor_df # Complete the time column with a 20 minute interval
complete(time = seq(min(time), max(time), by = "20 min"),
fill = list(enter = 0L, exit = 0L)) %>%
# Calculate the total number of people inside
mutate(total_inside = cumsum(enter + exit)) %>%
# Pivot the enter and exit columns to long format
pivot_longer(enter:exit, names_to = "direction", values_to = "n_people")
## # A tibble: 72 × 5
## time id total_inside direction n_people
## <dttm> <int> <int> <chr> <int>
## 1 2020-01-01 07:00:00 1 1 enter 1
## 2 2020-01-01 07:00:00 1 1 exit 0
## 3 2020-01-01 07:20:00 2 3 enter 2
## 4 2020-01-01 07:20:00 2 3 exit 0
## 5 2020-01-01 07:40:00 3 5 enter 3
## 6 2020-01-01 07:40:00 3 5 exit -1
## 7 2020-01-01 08:00:00 4 10 enter 5
## 8 2020-01-01 08:00:00 4 10 exit 0
## 9 2020-01-01 08:20:00 5 13 enter 3
## 10 2020-01-01 08:20:00 5 13 exit 0
## # … with 62 more rows
-
Use
ggplot()
to visualize then_people
in the building overtime
. Use thefill
argument to color the area plot bydirection
.
%>%
sensor_df # Complete the time column with a 20 minute interval
complete(time = seq(min(time), max(time), by = "20 min"),
fill = list(enter = 0L, exit = 0L)) %>%
# Calculate the total number of people inside
mutate(total_inside = cumsum(enter + exit)) %>%
# Pivot the enter and exit columns to long format
pivot_longer(enter:exit, names_to = "direction", values_to = "n_people") %>%
# Plot the number of people over time, fill by direction
ggplot(aes(x = time, y = n_people, fill = direction)) +
geom_area() +
geom_line(aes(y = total_inside))
Good job! You can see the entries and exits in the area plot and the total number of people inside in the black line. Sensor data often comes at high frequencies so being able to complete timestamps is a powerful skill.
4 Rectangling data
In the final chapter, you’ll learn how to turn nested data structures such as JSON and XML files into tidy, rectangular data. This skill will enable you to process data from web APIs. You’ll also learn how nested data structures can be used to write elegant modeling pipelines that produce tidy outputs.
4.1 Intro to non-rectangular data
4.1.1 Rectangular vs non-rectangular files
You saw a number of file types in the video. Can you remember which store data in a rectangular format and which don’t?
- Group these file types and formats into the right category.
- CSV stands for Comma Separated Values. It’s often used to save raw spreadsheet data without formatting.
- JSON stands for JavaScript Object Notation. It aims to be both human and machine readable.
- XML stands for Extensible Markup Language. It aims to be both human and machine readable.
- Rectangular data: Spreadsheet, CSV
- Non-rectangular data: JSON, XML
Great! Understanding how these file types are different will help you pick the right tools while tidying.
4.1.2 Rectangling Star Wars movies
Let’s pretend you’re a big Star Wars fan and decided to scrape some data from the Star Wars API. You’ve already loaded the JSON-formatted response into R, and now have two lists of movies named movie_list
and movie_planets_list
.
Your goal is to turn these into rectangular data frames with one row
per movie so that you can start crunching those movie stats.
The dplyr
package has been pre-loaded for you.
-
Create a
tibble
with a single column calledmovie
out of the inputmovie_list
.
=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/star_wars_movie_list.rds")
movie_list# Create a movie column from the movie_list
tibble(movie = movie_list)
## # A tibble: 7 × 1
## movie
## <list>
## 1 <named list [5]>
## 2 <named list [5]>
## 3 <named list [5]>
## 4 <named list [5]>
## 5 <named list [5]>
## 6 <named list [5]>
## 7 <named list [5]>
-
Widen the dataset by unnesting the
movie
column over multiple columns.
# Create a movie column from the movie_list
tibble(movie = movie_list) %>%
# Unnest the movie column
unnest_wider(movie)
## # A tibble: 7 × 5
## title episode_id release_date director producer
## <chr> <int> <chr> <chr> <chr>
## 1 A New Hope 4 1977-05-25 George Lucas Gary Kurtz, …
## 2 Attack of the Clones 2 2002-05-16 George Lucas Rick McCallum
## 3 The Phantom Menace 1 1999-05-19 George Lucas Rick McCallum
## 4 Revenge of the Sith 3 2005-05-19 George Lucas Rick McCallum
## 5 Return of the Jedi 6 1983-05-25 Richard Marquand Howard G. Ka…
## 6 The Empire Strikes Back 5 1980-05-17 Irvin Kershner Gary Kurtz, …
## 7 The Force Awakens 7 2015-12-11 J. J. Abrams Kathleen Ken…
-
Re-create the
tibble
with a single column calledmovie
out ofmovie_planets_list
.
=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/star_wars_movie_planet_list.rds")
movie_planets_list# Create a tibble with a movie column
tibble(movie = movie_planets_list) %>%
# Unnest the movie column
unnest_wider(movie)
## # A tibble: 7 × 3
## title release_date planets
## <chr> <chr> <list>
## 1 A New Hope 1977-05-25 <chr [3]>
## 2 The Empire Strikes Back 1980-05-17 <chr [4]>
## 3 Return of the Jedi 1983-05-25 <chr [5]>
## 4 The Phantom Menace 1999-05-19 <chr [3]>
## 5 Attack of the Clones 2002-05-16 <chr [5]>
## 6 Revenge of the Sith 2005-05-19 <chr [13]>
## 7 The Force Awakens 2015-12-11 <chr [1]>
-
Unnest the
planets
column to a wider format.
# Create a tibble with a movie column
tibble(movie = movie_planets_list) %>%
# Unnest the movie column
unnest_wider(movie) %>%
# Unnest the planets column
unnest_wider(planets)
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## New names:
## * `` -> ...1
## # A tibble: 7 × 15
## title release_date ...1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 A Ne… 1977-05-25 Alde… Yavi… Tato… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 The … 1980-05-17 Hoth Dago… Besp… Ord … <NA> <NA> <NA> <NA> <NA> <NA>
## 3 Retu… 1983-05-25 Dago… Endor Naboo Coru… Tato… <NA> <NA> <NA> <NA> <NA>
## 4 The … 1999-05-19 Naboo Coru… Tato… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 5 Atta… 2002-05-16 Naboo Coru… Kami… Geon… Tato… <NA> <NA> <NA> <NA> <NA>
## 6 Reve… 2005-05-19 Alde… Dago… Naboo Coru… Utap… Must… Kash… Poli… Myge… Felu…
## 7 The … 2015-12-11 Jakku <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## # … with 3 more variables: ...11 <chr>, ...12 <chr>, ...13 <chr>
Interesting! You’re able to unnest the full tree structure of the movie_planets_list
, but don’t end up with a tidy format. You’ll learn how to fix this in the next lesson.
4.2 From nested values to observations
4.2.1 Unnesting wide or long
In the slides, you learned some rules of thumb on when to use unnest_longer()
or unnest_wider()
to unnest a list column. You’ll now decide which function to use for two data samples.
- Assign these columns to the function that you would unnest them with.
unnest_wider()
metadata
<list>
<named list [3]>
<named list [3]>
<named list [3]>
<named list [3]>
unnest_longer()
films
<list>
<list [4]>
<list [5]>
<list [4]>
<list [3]>
Well done! These are the two situations where it is clear which unnesting function to try: named lists of fixed length = unnest_wider()
, unnamed lists of varying length = unnest_longer()
. There are other situations, like named lists with varying number of elements, where some trial and error will be necessary.
4.2.2 Rectangling Star Wars planets
Let’s finish what we started in the last exercise of the previous lesson, exploring Star Wars planets! The movie_planets_list
scraped from the Star Wars API has been pre-loaded for you. You’ll need two specific unnesting operations to completely rectangle this data.
-
Create a
tibble
with a single column calledmovie
out ofmovie_planets_list
.
# Create a tibble from movie_planets_list
tibble(movie = movie_planets_list)
## # A tibble: 7 × 1
## movie
## <list>
## 1 <named list [3]>
## 2 <named list [3]>
## 3 <named list [3]>
## 4 <named list [3]>
## 5 <named list [3]>
## 6 <named list [3]>
## 7 <named list [3]>
-
Unnest the
movie
list column which contains named lists of equal length.
# Create a tibble from movie_planets_list
tibble(movie = movie_planets_list) %>%
# Unnest the movie column in the correct direction
unnest_wider(movie)
## # A tibble: 7 × 3
## title release_date planets
## <chr> <chr> <list>
## 1 A New Hope 1977-05-25 <chr [3]>
## 2 The Empire Strikes Back 1980-05-17 <chr [4]>
## 3 Return of the Jedi 1983-05-25 <chr [5]>
## 4 The Phantom Menace 1999-05-19 <chr [3]>
## 5 Attack of the Clones 2002-05-16 <chr [5]>
## 6 Revenge of the Sith 2005-05-19 <chr [13]>
## 7 The Force Awakens 2015-12-11 <chr [1]>
-
Unnest the
planets
list column which contains unnamed lists of unequal length.
# Create a tibble from movie_planets_list
tibble(movie = movie_planets_list) %>%
# Unnest the movie column in the correct direction
unnest_wider(movie) %>%
# Unnest the planets column in the correct direction
unnest_longer(planets)
## # A tibble: 34 × 3
## title release_date planets
## <chr> <chr> <chr>
## 1 A New Hope 1977-05-25 Alderaan
## 2 A New Hope 1977-05-25 Yavin IV
## 3 A New Hope 1977-05-25 Tatooine
## 4 The Empire Strikes Back 1980-05-17 Hoth
## 5 The Empire Strikes Back 1980-05-17 Dagobah
## 6 The Empire Strikes Back 1980-05-17 Bespin
## 7 The Empire Strikes Back 1980-05-17 Ord Mantell
## 8 Return of the Jedi 1983-05-25 Dagobah
## 9 Return of the Jedi 1983-05-25 Endor
## 10 Return of the Jedi 1983-05-25 Naboo
## # … with 24 more rows
Nice work! You now have a tidy dataset with one observation for every planet featured in each movie.
4.2.3 The Solar System’s biggest moons
Most planets in our solar system are accompanied by at least one moon.
You now wonder which planets are circled by the biggest moons and want
to create a top five based on moon radius. However, you’ll first have to
unnest the devstronomy project data in planet_df
using the unnest_longer()
and unnest_wider()
functions.
The dplyr
package has been pre-loaded for you.
-
Unnest the
moons
column so that each moon gets an observation.
=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/planet_moons_list_df.rds")
planet_df%>%
planet_df # Unnest the moons list column over observations
unnest_longer(moons)
## # A tibble: 174 × 2
## planet moons
## <chr> <list>
## 1 Mercury <NULL>
## 2 Venus <NULL>
## 3 Earth <named list [2]>
## 4 Jupiter <named list [2]>
## 5 Jupiter <named list [2]>
## 6 Jupiter <named list [2]>
## 7 Jupiter <named list [2]>
## 8 Jupiter <named list [2]>
## 9 Jupiter <named list [2]>
## 10 Jupiter <named list [2]>
## # … with 164 more rows
-
Unnest the
moons
column so that its contents are split over columns.
%>%
planet_df # Unnest the moons list column over observations
unnest_longer(moons) %>%
# Further unnest the moons column
unnest_wider(moons)
## # A tibble: 174 × 3
## planet moon_name moon_data
## <chr> <chr> <list>
## 1 Mercury <NA> <NULL>
## 2 Venus <NA> <NULL>
## 3 Earth Moon <named list [2]>
## 4 Jupiter Io <named list [2]>
## 5 Jupiter Europa <named list [2]>
## 6 Jupiter Ganymede <named list [2]>
## 7 Jupiter Callisto <named list [2]>
## 8 Jupiter Amalthea <named list [2]>
## 9 Jupiter Himalia <named list [2]>
## 10 Jupiter Elara <named list [2]>
## # … with 164 more rows
-
Unnest the
moon_data
column so that its contents are split over columns.
%>%
planet_df # Unnest the moons list column over observations
unnest_longer(moons) %>%
# Further unnest the moons column
unnest_wider(moons) %>%
# Unnest the moon_data column
unnest_wider(moon_data)
## # A tibble: 174 × 4
## planet moon_name radius density
## <chr> <chr> <dbl> <dbl>
## 1 Mercury <NA> NA NA
## 2 Venus <NA> NA NA
## 3 Earth Moon 1738. 3.34
## 4 Jupiter Io 1822. 3.53
## 5 Jupiter Europa 1561. 3.01
## 6 Jupiter Ganymede 2631. 1.94
## 7 Jupiter Callisto 2410. 1.83
## 8 Jupiter Amalthea 83.4 0.849
## 9 Jupiter Himalia 85 2.6
## 10 Jupiter Elara 43 2.6
## # … with 164 more rows
-
Use
dplyr
’sslice_max()
function on moonradius
to get a top5
of biggest moons.
%>%
planet_df # Unnest the moons list column over observations
unnest_longer(moons) %>%
# Further unnest the moons column
unnest_wider(moons) %>%
# Unnest the moon_data column
unnest_wider(moon_data) %>%
# Get the top five largest moons by radius
slice_max(radius, n = 5)
## # A tibble: 5 × 4
## planet moon_name radius density
## <chr> <chr> <dbl> <dbl>
## 1 Jupiter Ganymede 2631. 1.94
## 2 Saturn Titan 2575. 1.88
## 3 Jupiter Callisto 2410. 1.83
## 4 Jupiter Io 1822. 3.53
## 5 Earth Moon 1738. 3.34
Well done! Jupiter has three moons in the top five with Ganymede being
the largest overall. We also have Saturn’s Titan and our very own Moon.
The more deeply nested your data is, the more iterations of unnest_longer()
and unnest_wider()
you’ll need to fully unnest it.
4.3 Selecting nested variables
4.3.1 Hoisting Star Wars films
You’ve been given a nested data set on Star Wars characters (character_df
) and want to explore the films in which they appeared.
You’ll first use the unnest_wider()
and unnest_longer()
functions to explore the data and will then switch to hoist()
to select a specific element in the nested data structure directly.
-
Inspect
character_df
in the console. -
Unnest the
metadata
column.
=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/star_wars_characters.rds")
character_df%>%
character_df # Unnest the metadata column
unnest_wider(metadata)
## # A tibble: 82 × 4
## name species homeworld films
## <chr> <chr> <chr> <list>
## 1 Ackbar Mon Calamari Mon Cala <chr [2]>
## 2 Adi Gallia Tholothian Coruscant <chr [2]>
## 3 Anakin Skywalker Human Tatooine <chr [3]>
## 4 Arvel Crynyd Human unknown <chr [1]>
## 5 Ayla Secura Twi'lek Ryloth <chr [3]>
## 6 Bail Prestor Organa Human Alderaan <chr [2]>
## 7 Barriss Offee Mirialan Mirial <chr [1]>
## 8 BB8 Droid unknown <chr [1]>
## 9 Ben Quadinaros Toong Tund <chr [1]>
## 10 Beru Whitesun lars Human Tatooine <chr [3]>
## # … with 72 more rows
-
Unnest the
films
column.
%>%
character_df # Unnest the metadata column
unnest_wider(metadata) %>%
unnest_longer(films)
## # A tibble: 166 × 4
## name species homeworld films
## <chr> <chr> <chr> <chr>
## 1 Ackbar Mon Calamari Mon Cala Return of the Jedi
## 2 Ackbar Mon Calamari Mon Cala The Force Awakens
## 3 Adi Gallia Tholothian Coruscant The Phantom Menace
## 4 Adi Gallia Tholothian Coruscant Revenge of the Sith
## 5 Anakin Skywalker Human Tatooine Attack of the Clones
## 6 Anakin Skywalker Human Tatooine The Phantom Menace
## 7 Anakin Skywalker Human Tatooine Revenge of the Sith
## 8 Arvel Crynyd Human unknown Return of the Jedi
## 9 Ayla Secura Twi'lek Ryloth Attack of the Clones
## 10 Ayla Secura Twi'lek Ryloth The Phantom Menace
## # … with 156 more rows
-
As an alternative approach, use
hoist()
to select the first film from thefilms
list nested in themetadata
column.
%>%
character_df hoist(metadata, first_film = list("films", 1))
## # A tibble: 82 × 3
## name first_film metadata
## <chr> <chr> <list>
## 1 Ackbar Return of the Jedi <named list [3]>
## 2 Adi Gallia The Phantom Menace <named list [3]>
## 3 Anakin Skywalker Attack of the Clones <named list [3]>
## 4 Arvel Crynyd Return of the Jedi <named list [3]>
## 5 Ayla Secura Attack of the Clones <named list [3]>
## 6 Bail Prestor Organa Attack of the Clones <named list [3]>
## 7 Barriss Offee Attack of the Clones <named list [3]>
## 8 BB8 The Force Awakens <named list [3]>
## 9 Ben Quadinaros The Phantom Menace <named list [3]>
## 10 Beru Whitesun lars Attack of the Clones <named list [3]>
## # … with 72 more rows
Nice work! You dug into multiple layers of the nested structure to fetch the data you wanted.
4.3.2 Hoisting movie ratings
You’ve written a script to scrape data on your favorite movies from the Open Movie DataBase API. Now you want to process the JSON data to extract the Rotten Tomatoes rating for each movie. You’ve been given a data frame named movie_df
which holds the JSON respones for five movies. You’ll explore this data with unnest_wider()
and unnest_longer()
before switching to hoist()
.
The dplyr
package has been pre-loaded for you.
-
Unnest the
movie
column.
#movie_df %>%
# Unnest the movie column
# unnest_wider(movie)
-
Unnest the
Ratings
column.
#movie_df %>%
# Unnest the movie column
# unnest_wider(movie) %>%
# select(Title, Year, Ratings) %>%
# Unnest the Ratings column
# unnest_wider(Ratings)
-
Use
hoist()
on themovie
column to extract theTitle
,Year
, andRotten Tomatoes
rating. Note that this rating is nested inside theRatings
column.
# movie_df %>%
# hoist(
# movie,
# title = "Title",
# year = "Year",
# rating = list("Ratings", "Rotten Tomatoes")
# )
Well done! Most data scientists try to stay away from nested data as much as possible, that said, you now know how to get your hands dirty when you have to.
4.4 Nesting data for modeling
4.4.1 Tidy model outputs with broom
You’re trying to predict a person’s weight based on their waist circumference and stature (height). To do so you’re using the US army body measurement dataset ANSUR II. The model has already been trained for you using this code:
model <- lm(weight_kg ~ waist_circum_m + stature_m, data = ansur_df)
You will use the broom
package’s glance()
and tidy()
functions in the console to inspect model
outputs in a tidy format.
=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/ansur_weight_model.rds")
modellibrary(broom)
tidy(model)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -128. 1.25 -103. 0
## 2 waist_circum_m 99.8 0.615 162. 0
## 3 stature_m 68.1 0.781 87.2 0
glance(model)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.894 0.894 5.09 25638. 0 2 -18485. 36977. 37004.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
What is the R2 value?
-
0
-
0.894
-
5.09
What is the standard error on the intercept?
-
-128.
-
0.615
-
1.25
Correct!
4.4.2 Nesting tibbles
You’re pre-processing the US army body measurement dataset ANSUR II to train multiple models in a single pipeline. You’ll experiment with the nest()
function to create a list column with nested tibbles containing sub-sets of the data.
the dplyr
package has been pre-loaded for you.
-
Group the data by army
branch
and then nest it.
# ansur_df=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/ANSUR_II.rds")
# ansur_df %>%
# # Group the data by branch, then nest
# group_by(branch) %>%
# nest()
What is the shape of the Combat Arms branch nested dataset?
-
1,312 × 5
-
1,582 × 5
-
3,174 × 5
-
Group the data by both
branch
andsex
, then nest it.
# ansur_df %>%
# # Group the data by branch and sex, then nest
# group_by(branch, sex) %>%
# nest()
What is the nested tibble shape for females in the Combat Arms branch?
-
43 × 4
-
687 × 4
-
1,539 × 4
Well done! Now that you know how to nest your data and how to use the broom
package to tidy model outputs, it’s time to put things together.
4.4.3 Modeling on nested data frames
You’ll be working on the US Army ANSUR II body measurement dataset, which has been pre-loaded as ansur_df
.
The goal is to nest the data for both sexes so that you can
simultaneously train two linear models, one for each sex. These models
will derive a person’s weight from their stature (height) and waist
circumference. You’ll then unnest the data to inspect the model’s
statistics produced by the glance()
function from the broom
package.
The dplyr
, broom
, and purrr
packages have been pre-loaded for you.
Side note: In the provided code, the purrr
package’s map()
function applies functions on each nested data frame. Check out this package if you like using functions inside pipes!
-
Group the data by
sex
. - Nest the data.
-
Unnest the
glanced
column.
=readRDS("/Users/apple/Documents/Rstudio/DataCamp/ReshapingDatawithtidyr/DATABASE/ANSUR_II.rds")
ansur_df%>%
ansur_df # Group the data by sex
group_by(sex) %>%
# Nest the data
nest() %>%
mutate(
fit = map(data, function(df) lm(weight_kg ~ waist_circum_m + stature_m, data = df)),
glanced = map(fit, glance)
%>%
) # Unnest the glanced column
unnest(glanced)
## # A tibble: 2 × 15
## # Groups: sex [2]
## sex data fit r.squared adj.r.squared sigma statistic p.value df
## <chr> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Female <tibble> <lm> 0.822 0.822 4.63 4585. 0 2
## 2 Male <tibble> <lm> 0.873 0.873 5.06 14076. 0 2
## # … with 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
## # df.residual <int>, nobs <int>
Well done! You’ve trained multiple models and created tidy outputs in a single pipeline making it easy to compare them.
Congratulations! You did it! You’re now able to reshape data in R using tidyr
. Tweet us your feedback and tell us what you think.