Functional Programming with purrr
DataCamp
Course Description
Lists can be difficult to both understand and manipulate, but they can pack a ton of information and are very powerful. In this course, you will learn to easily extract, summarize, and manipulate lists and how to export the data to your desired object, be it another list, a vector, or even something else! Throughout the course, you will work with the purrr package and a variety of datasets from the repurrrsive package, including data from Star Wars and Wes Anderson films and data collected about GitHub users and GitHub repos. Following this course, your list skills will be purrrfect!
1 Simplifying with purrr
Iteration is a powerful way to make the computer do the work for you. It can also be an area of coding where it is easy to make lots of typos and simple mistakes. The purrr package helps simplify iteration so you can focus on the next step, instead of finding typos.
1.1 The power of iteration
1.1.1 Introduction to iteration
Imagine that you need to read in hundreds of files with a similar structure and perform an action on them. You don’t want to write hundreds of repetitive lines of code to read in all the files or to perform the action. Instead, you want to iterate over them. Iteration is the process of doing the same process to multiple inputs. Being able to iterate is important to make your code efficient, and is powerful when working with lists.
For this exercise, the names of 16 CSV files have been loaded into a list called files
. In your own work, you could use the list.files()
function to create this list. The readr
library is also already loaded.
This course touches on a lot of concepts you may have forgotten, so if you ever need a quick refresher, download the tidyverse Cheat Sheet and keep it handy!
files
list, and gives each element as an input for readr::read_csv()
, which is another way of saying the read_csv()
function from the readr
package.
# Initialize list
<- list() all_files
all_files
list.
=list.files("/Users/apple/Documents/Rstudio/DataCamp/FoundationsofFunctionalProgrammingwithpurrr/simulated_data_from_1990_to_2005", pattern = "*.csv")
files=paste("/Users/apple/Documents/Rstudio/DataCamp/FoundationsofFunctionalProgrammingwithpurrr/simulated_data_from_1990_to_2005/",files,sep="")
files# For loop to read files into a list
for(i in seq_along(files)){
<- read_csv(files[[i]])
all_files[[i]] }
head(all_files)
## [[1]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1990 5.25 197.
## 2 1990 8.17 192.
## 3 1990 6.49 192.
## 4 1990 5.82 195.
## 5 1990 5.54 201.
## 6 1990 6.65 196.
## 7 1990 10.4 208.
## 8 1990 1.66 183.
## 9 1990 2.78 174.
## 10 1990 8.34 198.
## # … with 190 more rows
##
## [[2]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1991 3.70 197.
## 2 1991 5.37 187.
## 3 1991 7.05 186.
## 4 1991 1.97 207.
## 5 1991 8.05 217.
## 6 1991 1.97 213.
## 7 1991 5.33 195.
## 8 1991 4.32 204.
## 9 1991 4.46 177.
## 10 1991 4.63 222.
## # … with 190 more rows
##
## [[3]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1992 8.64 178.
## 2 1992 3.70 207.
## 3 1992 4.79 206.
## 4 1992 9.22 194.
## 5 1992 6.49 202.
## 6 1992 4.58 197.
## 7 1992 5.06 174.
## 8 1992 2.20 216.
## 9 1992 4.72 177.
## 10 1992 10.0 188.
## # … with 190 more rows
##
## [[4]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1993 2.34 204.
## 2 1993 5.44 167.
## 3 1993 6.86 213.
## 4 1993 5.70 197.
## 5 1993 2.78 193.
## 6 1993 3.24 164.
## 7 1993 5.59 234.
## 8 1993 3.02 183.
## 9 1993 4.60 182.
## 10 1993 7.56 205.
## # … with 190 more rows
##
## [[5]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1994 3.40 197.
## 2 1994 4.29 214.
## 3 1994 6.91 175.
## 4 1994 3.11 181.
## 5 1994 5.50 185.
## 6 1994 3.59 211.
## 7 1994 2.97 189.
## 8 1994 7.40 171.
## 9 1994 9.66 198.
## 10 1994 8.19 221.
## # … with 190 more rows
##
## [[6]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1995 5.12 197.
## 2 1995 4.18 219.
## 3 1995 3.70 186.
## 4 1995 4.46 204.
## 5 1995 7.48 209.
## 6 1995 8.38 204.
## 7 1995 4.51 202.
## 8 1995 5.68 208.
## 9 1995 5.24 211.
## 10 1995 3.04 212.
## # … with 190 more rows
map(all_files,~head(.x))
## [[1]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1990 5.25 197.
## 2 1990 8.17 192.
## 3 1990 6.49 192.
## 4 1990 5.82 195.
## 5 1990 5.54 201.
## 6 1990 6.65 196.
##
## [[2]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1991 3.70 197.
## 2 1991 5.37 187.
## 3 1991 7.05 186.
## 4 1991 1.97 207.
## 5 1991 8.05 217.
## 6 1991 1.97 213.
##
## [[3]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1992 8.64 178.
## 2 1992 3.70 207.
## 3 1992 4.79 206.
## 4 1992 9.22 194.
## 5 1992 6.49 202.
## 6 1992 4.58 197.
##
## [[4]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1993 2.34 204.
## 2 1993 5.44 167.
## 3 1993 6.86 213.
## 4 1993 5.70 197.
## 5 1993 2.78 193.
## 6 1993 3.24 164.
##
## [[5]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1994 3.40 197.
## 2 1994 4.29 214.
## 3 1994 6.91 175.
## 4 1994 3.11 181.
## 5 1994 5.50 185.
## 6 1994 3.59 211.
##
## [[6]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1995 5.12 197.
## 2 1995 4.18 219.
## 3 1995 3.70 186.
## 4 1995 4.46 204.
## 5 1995 7.48 209.
## 6 1995 8.38 204.
##
## [[7]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1996 7.90 185.
## 2 1996 10.2 178.
## 3 1996 7.28 210.
## 4 1996 5.51 189.
## 5 1996 4.47 209.
## 6 1996 7.29 207.
##
## [[8]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1997 2.52 225.
## 2 1997 4.85 194.
## 3 1997 1.47 211.
## 4 1997 3.28 184.
## 5 1997 2.11 187.
## 6 1997 5.51 198.
##
## [[9]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1998 5.26 190.
## 2 1998 2.84 184.
## 3 1998 4.81 238.
## 4 1998 5.79 201.
## 5 1998 5.97 196.
## 6 1998 7.01 180.
##
## [[10]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1999 3.71 188.
## 2 1999 4.37 216.
## 3 1999 2.78 157.
## 4 1999 9.02 192.
## 5 1999 4.11 204.
## 6 1999 6.34 204.
##
## [[11]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 2000 5.57 196.
## 2 2000 3.40 202.
## 3 2000 10.5 196.
## 4 2000 2.73 196.
## 5 2000 -0.410 189.
## 6 2000 2.61 218.
##
## [[12]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 2001 5.33 213.
## 2 2001 2.27 201.
## 3 2001 3.23 200.
## 4 2001 6.00 191.
## 5 2001 6.41 194.
## 6 2001 3.11 223.
##
## [[13]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 2002 6.63 188.
## 2 2002 -0.778 216.
## 3 2002 3.16 193.
## 4 2002 7.62 198.
## 5 2002 2.08 209.
## 6 2002 5.14 212.
##
## [[14]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 2003 5.59 173.
## 2 2003 4.58 207.
## 3 2003 6.27 201.
## 4 2003 -1.74 195.
## 5 2003 6.54 182.
## 6 2003 5.15 203.
##
## [[15]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 2004 7.89 222.
## 2 2004 6.05 177.
## 3 2004 3.83 212.
## 4 2004 4.15 198.
## 5 2004 3.02 196.
## 6 2004 2.58 206.
##
## [[16]]
## # A tibble: 6 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 2005 8.73 201.
## 2 2005 3.47 191.
## 3 2005 2.19 194.
## 4 2005 4.39 211.
## 5 2005 6.33 180.
## 6 2005 -1.58 219.
all_files
list.
# Output size of list object
length(all_files)
## [1] 16
Good work! Now let’s see how to do it more easily with purrr.
1.1.2 Iteration with purrr
You’ve made a great for loop, but it uses a lot of code to do something
as simple as input a series of files into a list. This is where purrr
comes in. We can do the same thing as a for loop in one line of code with purrr::map()
. The function map()
iterates over a list, and uses another function that can specified with the .f
argument.
map()
takes two arguments:
- The first is the list over that will be iterated over
- The second is a function that will act on each element of the list
The readr
library is already loaded.
purrr
library (note the 3 Rs).
# Load purrr library
library(purrr)
map()
instead. Use the same list files
and the same function readr::read_csv()
.
# Use map to iterate
<- map(files, read_csv) all_files_purrr
head(all_files_purrr)
## [[1]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1990 5.25 197.
## 2 1990 8.17 192.
## 3 1990 6.49 192.
## 4 1990 5.82 195.
## 5 1990 5.54 201.
## 6 1990 6.65 196.
## 7 1990 10.4 208.
## 8 1990 1.66 183.
## 9 1990 2.78 174.
## 10 1990 8.34 198.
## # … with 190 more rows
##
## [[2]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1991 3.70 197.
## 2 1991 5.37 187.
## 3 1991 7.05 186.
## 4 1991 1.97 207.
## 5 1991 8.05 217.
## 6 1991 1.97 213.
## 7 1991 5.33 195.
## 8 1991 4.32 204.
## 9 1991 4.46 177.
## 10 1991 4.63 222.
## # … with 190 more rows
##
## [[3]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1992 8.64 178.
## 2 1992 3.70 207.
## 3 1992 4.79 206.
## 4 1992 9.22 194.
## 5 1992 6.49 202.
## 6 1992 4.58 197.
## 7 1992 5.06 174.
## 8 1992 2.20 216.
## 9 1992 4.72 177.
## 10 1992 10.0 188.
## # … with 190 more rows
##
## [[4]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1993 2.34 204.
## 2 1993 5.44 167.
## 3 1993 6.86 213.
## 4 1993 5.70 197.
## 5 1993 2.78 193.
## 6 1993 3.24 164.
## 7 1993 5.59 234.
## 8 1993 3.02 183.
## 9 1993 4.60 182.
## 10 1993 7.56 205.
## # … with 190 more rows
##
## [[5]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1994 3.40 197.
## 2 1994 4.29 214.
## 3 1994 6.91 175.
## 4 1994 3.11 181.
## 5 1994 5.50 185.
## 6 1994 3.59 211.
## 7 1994 2.97 189.
## 8 1994 7.40 171.
## 9 1994 9.66 198.
## 10 1994 8.19 221.
## # … with 190 more rows
##
## [[6]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1995 5.12 197.
## 2 1995 4.18 219.
## 3 1995 3.70 186.
## 4 1995 4.46 204.
## 5 1995 7.48 209.
## 6 1995 8.38 204.
## 7 1995 4.51 202.
## 8 1995 5.68 208.
## 9 1995 5.24 211.
## 10 1995 3.04 212.
## # … with 190 more rows
all_files_purrr
.
# Output size of list object
length(all_files_purrr)
## [1] 16
Nice! You can see from the output here that 16 different files have been read into all_files_purrr
.
1.1.3 More iteration with for loops
Iteration isn’t just for reading in files though; iteration can be used to perform other actions on objects. First, you will try iterating with a for loop.
You’re going to change each element of a list into a numeric data type and then put it back into the same element in the same list.
For this exercise, you will iterate using a for loop that takes list_of_df
,
which is a list of character vector, but the characters are actually
numbers! You need to change the character vectors to numeric so that you
can perform mathematical operations on them; you can use the base R
function, as.numeric()
to do that.
list_of_df
.
=lapply(1:10,function(x){1:4})
list_of_df# Check the class type of the first element
class(list_of_df[[1]])
## [1] "integer"
list_of_df
, changes it into numeric data with as.numeric()
, and adds it back into the same element of list_of_df
.
# Change each element from a character to a number
for(i in seq_along(list_of_df)){
<- as.numeric(list_of_df[[i]])
list_of_df[[i]] }
list_of_df
.
# Check the class type of the first element
class(list_of_df[[1]])
## [1] "numeric"
list_of_df
.
# Print out the list
head(list_of_df)
## [[1]]
## [1] 1 2 3 4
##
## [[2]]
## [1] 1 2 3 4
##
## [[3]]
## [1] 1 2 3 4
##
## [[4]]
## [1] 1 2 3 4
##
## [[5]]
## [1] 1 2 3 4
##
## [[6]]
## [1] 1 2 3 4
Nice! You can see from the output that we have a list of numbers now!
1.1.4 More iteration with purrr
Now you will change each element of a list into a numeric data type and
then put it back into the same element in the same list, but instead of
using a for loop, you’ll use map()
.
You can use the purrr
function map()
to more
easily loop over a list, and turn the characters into numbers. Instead
of having to build a whole for loop, you can use one line of code.
list_of_df
.
# Check the class type of the first element
class(list_of_df[[1]])
## [1] "numeric"
map()
to iterate over list_of_df
and change each element of the list into numeric data.
# Change each character element to a number
<- map(list_of_df, as.numeric) list_of_df
list_of_df
.
# Check the class type of the first element again
class(list_of_df[[1]])
## [1] "numeric"
list_of_df
.
# Print out the list
head(list_of_df)
## [[1]]
## [1] 1 2 3 4
##
## [[2]]
## [1] 1 2 3 4
##
## [[3]]
## [1] 1 2 3 4
##
## [[4]]
## [1] 1 2 3 4
##
## [[5]]
## [1] 1 2 3 4
##
## [[6]]
## [1] 1 2 3 4
Good work! Now you can fix class type issues in your lists!
1.2 Subsetting lists
1.2.1 Subsetting lists
Often when working in R, you’ll use dataframes or vectors. Another kind of R object is a list. While lists can be complicated, lists are also incredibly powerful. Lists are like Hermione Granger’s bag of holding (from Harry Potter); they can hold a wide variety of things. The contents of a list don’t have to be the same data type, and as long as you know how it’s organized, you can grab out what you need by subsetting.
Both named and unnamed lists can be subset using double square brackets [[ ]]
list this: listname[[ index ]]
If a list is named, you can also use $
for subsetting. The syntax list$elementname
pulls out the named element from the list. Like any other kind of object in R, you can use the str()
to determine the structure of the list.
repurrrsive
package.
# Load repurrrsive package, to get access to the wesanderson dataset
library(repurrrsive)
wesanderson
dataset.
# Load wesanderson dataset
data(wesanderson)
wesanderson
.
# Get structure of first element in wesanderson
str(wesanderson[[1]])
## chr [1:4] "#F1BB7B" "#FD6467" "#5B1A18" "#D67236"
GrandBudapest
element in wesanderson
.
# Get structure of GrandBudapest element in wesanderson
str(wesanderson$GrandBudapest)
## chr [1:4] "#F1BB7B" "#FD6467" "#5B1A18" "#D67236"
Good work! Now you can subset and determine the structure of each part of a named or unnamed list!
1.2.2 Subsetting list elements
You can also subset within list elements using bracket notation like this: ListName$ElementName[VectorNumber]
. If a list element is a dataframe, you can pull out a column like this: ListName$ElementName$ColumnName
or ListName[[1]][,1]
.
In this exercise, you’ll examine the wesanderson
and sw_films
datasets from the repurrrsive
package. wesanderson
contains color palettes for each of Wes Anderson’s movies. These colors
are recorded in hexadecimal, that is, a # followed by six digits that
indicate a particular color. Here, you will be using two ways of pulling
out a particular color hexadecimal.
sw_films
contains information about the films in the Star
Wars franchise, such as title, director, producer, etc. You’ll use
subsetting to explore this dataset.
Subset the third color from the first element of wesanderson
. Then subset the fourth color from GrandBudapest
.
# Third element of the first wesanderson vector
1]][3] wesanderson[[
## [1] "#5B1A18"
# Fourth element of the GrandBudapest wesanderson vector
$GrandBudapest[4] wesanderson
## [1] "#D67236"
Subset the first element from sw_films
. Then subset the title element from the first element.
# Subset the first element of the sw_films data
1]] sw_films[[
## $title
## [1] "A New Hope"
##
## $episode_id
## [1] 4
##
## $opening_crawl
## [1] "It is a period of civil war.\r\nRebel spaceships, striking\r\nfrom a hidden base, have won\r\ntheir first victory against\r\nthe evil Galactic Empire.\r\n\r\nDuring the battle, Rebel\r\nspies managed to steal secret\r\nplans to the Empire's\r\nultimate weapon, the DEATH\r\nSTAR, an armored space\r\nstation with enough power\r\nto destroy an entire planet.\r\n\r\nPursued by the Empire's\r\nsinister agents, Princess\r\nLeia races home aboard her\r\nstarship, custodian of the\r\nstolen plans that can save her\r\npeople and restore\r\nfreedom to the galaxy...."
##
## $director
## [1] "George Lucas"
##
## $producer
## [1] "Gary Kurtz, Rick McCallum"
##
## $release_date
## [1] "1977-05-25"
##
## $characters
## [1] "http://swapi.co/api/people/1/" "http://swapi.co/api/people/2/"
## [3] "http://swapi.co/api/people/3/" "http://swapi.co/api/people/4/"
## [5] "http://swapi.co/api/people/5/" "http://swapi.co/api/people/6/"
## [7] "http://swapi.co/api/people/7/" "http://swapi.co/api/people/8/"
## [9] "http://swapi.co/api/people/9/" "http://swapi.co/api/people/10/"
## [11] "http://swapi.co/api/people/12/" "http://swapi.co/api/people/13/"
## [13] "http://swapi.co/api/people/14/" "http://swapi.co/api/people/15/"
## [15] "http://swapi.co/api/people/16/" "http://swapi.co/api/people/18/"
## [17] "http://swapi.co/api/people/19/" "http://swapi.co/api/people/81/"
##
## $planets
## [1] "http://swapi.co/api/planets/2/" "http://swapi.co/api/planets/3/"
## [3] "http://swapi.co/api/planets/1/"
##
## $starships
## [1] "http://swapi.co/api/starships/2/" "http://swapi.co/api/starships/3/"
## [3] "http://swapi.co/api/starships/5/" "http://swapi.co/api/starships/9/"
## [5] "http://swapi.co/api/starships/10/" "http://swapi.co/api/starships/11/"
## [7] "http://swapi.co/api/starships/12/" "http://swapi.co/api/starships/13/"
##
## $vehicles
## [1] "http://swapi.co/api/vehicles/4/" "http://swapi.co/api/vehicles/6/"
## [3] "http://swapi.co/api/vehicles/7/" "http://swapi.co/api/vehicles/8/"
##
## $species
## [1] "http://swapi.co/api/species/5/" "http://swapi.co/api/species/3/"
## [3] "http://swapi.co/api/species/2/" "http://swapi.co/api/species/1/"
## [5] "http://swapi.co/api/species/4/"
##
## $created
## [1] "2014-12-10T14:23:31.880000Z"
##
## $edited
## [1] "2015-04-11T09:46:52.774897Z"
##
## $url
## [1] "http://swapi.co/api/films/1/"
# Subset the first element of the sw_films data, the title column
1]]$title sw_films[[
## [1] "A New Hope"
Great work, now you should be very comfortable subsetting lists!
1.3 The many flavors of map()
1.3.1 map() argument alternatives
You can also use iteration to answer a question, like how long is each element in the wesanderson
dataset. You can do this by feeding map()
a function like length()
. You can do this using the map(list, function)
syntax and it works just fine. However, future exercises get more
complex, you will need to learn how to do this second way, using:
map(list, ~function(.x))
This second way gives the same result as map(list, function)
. To specify how the list is used in the function, use the argument .x
to denote where the list element goes inside the function. When you want to use .x
to show where the element goes in the function, you need to put a ~
in front of the function in the second argument of map()
.
Use map()
on wesanderson
and determine the length of each element in the “old” way.
# Map over wesanderson to get the length of each element
map(wesanderson, length)
## $GrandBudapest
## [1] 4
##
## $Moonrise1
## [1] 4
##
## $Royal1
## [1] 4
##
## $Moonrise2
## [1] 4
##
## $Cavalcanti
## [1] 5
##
## $Royal2
## [1] 5
##
## $GrandBudapest2
## [1] 4
##
## $Moonrise3
## [1] 5
##
## $Chevalier
## [1] 4
##
## $Zissou
## [1] 5
##
## $FantasticFox
## [1] 5
##
## $Darjeeling
## [1] 5
##
## $Rushmore
## [1] 5
##
## $BottleRocket
## [1] 7
##
## $Darjeeling2
## [1] 5
Use map()
on wesanderson
and determine the length of each element again, but this time using map(list, ~function(.x))
.
# Map over wesanderson, and determine the length of each element
map(wesanderson, ~length(.x))
## $GrandBudapest
## [1] 4
##
## $Moonrise1
## [1] 4
##
## $Royal1
## [1] 4
##
## $Moonrise2
## [1] 4
##
## $Cavalcanti
## [1] 5
##
## $Royal2
## [1] 5
##
## $GrandBudapest2
## [1] 4
##
## $Moonrise3
## [1] 5
##
## $Chevalier
## [1] 4
##
## $Zissou
## [1] 5
##
## $FantasticFox
## [1] 5
##
## $Darjeeling
## [1] 5
##
## $Rushmore
## [1] 5
##
## $BottleRocket
## [1] 7
##
## $Darjeeling2
## [1] 5
Great Job! This new way of writing map_*()
functions will come in handy in future exercises, so make a mental note of the ~
and the .x
argument.
1.3.2 map_*
The map()
function will return its output as a list. However, there are several different map()
functions; you can use map_()
functions to tell purrr
the type of output you want. The in map_*()
represents different R data types. For instance, you might want the
output to be a vector of numbers so that we can put it inside a
dataframe. So, unless you want something to be returned as a list, you
need to determine what you want the output to be before you write your map()
function.
-
Determine the length of each element of the
wesanderson
dataset using our originalmap()
function. Examine the output.
# Map over wesanderson, to determine the length of each element
map(wesanderson, length)
## $GrandBudapest
## [1] 4
##
## $Moonrise1
## [1] 4
##
## $Royal1
## [1] 4
##
## $Moonrise2
## [1] 4
##
## $Cavalcanti
## [1] 5
##
## $Royal2
## [1] 5
##
## $GrandBudapest2
## [1] 4
##
## $Moonrise3
## [1] 5
##
## $Chevalier
## [1] 4
##
## $Zissou
## [1] 5
##
## $FantasticFox
## [1] 5
##
## $Darjeeling
## [1] 5
##
## $Rushmore
## [1] 5
##
## $BottleRocket
## [1] 7
##
## $Darjeeling2
## [1] 5
-
Create a dataframe that has the number of colors from each movie, using
map_dbl()
. The dbl means a double or a number that can have a decimal.
# Create a numcolors column and fill with length of each wesanderson element
data.frame(numcolors = map_dbl(wesanderson, ~length(.x)))
## numcolors
## GrandBudapest 4
## Moonrise1 4
## Royal1 4
## Moonrise2 4
## Cavalcanti 5
## Royal2 5
## GrandBudapest2 4
## Moonrise3 5
## Chevalier 4
## Zissou 5
## FantasticFox 5
## Darjeeling 5
## Rushmore 5
## BottleRocket 7
## Darjeeling2 5
Good work! Notice how much cleaner the output was using map_dbl()
! It’s always worth thinking through which map_*()
function will get you where to need to go before coding it out. In our
next chapter, we’ll dive into more complex uses of purrr.
2 More complex iterations
purrr is much more than a for loop; it works well with pipes, we can use it to run models and simulate data, and make nested loops!
2.1 Working with unnamed lists
2.1.1 Names & pipe refresher
It is easy to determine if a list has names using names()
.
Understanding the named elements of a list can make working with the
list elements easier because you can pull out the information you need
by name, instead of searching for the correct numbered element.
purrr
is a part of the tidyverse, a system of packages
designed to be used together, and used with pipes. Let’s do a quick
refresh on how pipes work. A pipe %>%
takes the output
from the function that comes before it, and feeds it into the function
that comes after the pipe as its first argument.
function_before() %>%
function_after()
You don’t need to use pipes when you use purrr
functions, but for the purposes of these lessons, you will be.
-
Check to see if the
sw_films
list has named elements with pipes.
# Use pipes to check for names in sw_films
%>%
sw_films names()
## NULL
Good work! Now that you know how to check to see if a list has names in a tidy way, you’re ready to dive in.
2.1.2 Setting names
If you have an unnamed list, you can, of course, name each element. This
can be very useful for being able to call out certain elements in a
list, regardless of their order, especially if you are working with a
list that may grow or change over time, or if you use the same code on
several different lists. For instance, if you have a list that contains,
a dataframe, a model, and a plot, being able to call out $plot
instead of searching to figure out what numbered element of the plot, is much easier.
-
Name each element of
sw_films
list and assign to a new list,sw_films_named
. - Iterate over the title element.
# Set names so each element of the list is named for the film title
<- sw_films %>%
sw_films_named set_names(map_chr(sw_films, "title"))
# Check to see if the names worked/are correct
names(sw_films_named)
## [1] "A New Hope" "Attack of the Clones"
## [3] "The Phantom Menace" "Revenge of the Sith"
## [5] "Return of the Jedi" "The Empire Strikes Back"
## [7] "The Force Awakens"
Good work! Naming lists makes working in purrr
easier and more human-readable.
2.1.3 Pipes in map()
So you’ve refreshed your memory on how pipes can be used between functions. You can also use pipes on the inside of map()
function to help you iterate a pipeline of tasks over a list of inputs.
Here instead of using one of the repurrrsive
datasets, you will be working with a list of numbers so that you can do a few mathematical operations.
# Create a list of values from 1 through 10
<- list(1,2,3,4,5,6,7,8,9,10) numlist
map()
function that takes the sqrt()
of each element, and then the sin()
of each element.
# Iterate over the numlist
map(numlist, ~.x %>% sqrt() %>% sin()) %>% head()
## [[1]]
## [1] 0.841471
##
## [[2]]
## [1] 0.9877659
##
## [[3]]
## [1] 0.9870266
##
## [[4]]
## [1] 0.9092974
##
## [[5]]
## [1] 0.7867491
##
## [[6]]
## [1] 0.6381576
Good work! Using pipes inside of map()
makes iterating over multiple functions easy.
2.2 More map()
2.2.1 Simulating Data with Purrr
Often when trying to solve a problem with data we first need to build some simulated data to see if our idea is even possible. For example, you may want to test models with data that have known differences, to see if the models are working correctly.
In this exercise, you will see how this works in purrr
by simulating data for two populations, a
and b
,
from the sites: “north”, “east”, and “west”. The two populations will
be randomly drawn from a normal distribution, with different means and
standard deviations.
# List of sites north, east, and west
<- list("north","east","west") sites
map()
to create a list of dataframes with three columns, the first column is sites.
-
The second is population
a
, which has amean
of 5, a sample sizen
of 200, and ansd
of (5/2). -
The third is population
b
, which has amean
of 200, a sample sizen
of 200, and ansd
of 15.
# Create a list of dataframes, each with a years, a, and b column
<- map(sites,
list_of_df ~data.frame(sites = .x,
a = rnorm(mean = 5, n = 200, sd = (5/2)),
b = rnorm(mean = 200, n = 200, sd = 15)))
map(list_of_df,~head(.x))
## [[1]]
## sites a b
## 1 north 4.614419 214.1417
## 2 north 4.362532 190.2344
## 3 north 1.858374 204.0246
## 4 north 6.746475 206.2674
## 5 north 0.405748 179.3074
## 6 north 7.383507 192.7544
##
## [[2]]
## sites a b
## 1 east 5.230285 188.8726
## 2 east 1.473186 179.3934
## 3 east -1.308601 203.1301
## 4 east 3.674980 215.1431
## 5 east 4.948778 209.1833
## 6 east 2.825842 214.3016
##
## [[3]]
## sites a b
## 1 west 1.982326 214.8582
## 2 west 3.490015 198.0501
## 3 west 5.558575 189.6605
## 4 west 1.867846 195.0652
## 5 west 2.367538 187.1540
## 6 west 3.542964 200.7730
Good work! Now you can simulate data with ease.
2.2.2 Run a linear model
You can use map()
to do more than just take the square root of a number or simulate data. You can also use map()
to loop over different inputs to run several models, each using the
unique values of a given list element. You can also then iterate over
the models you’ve run to create the model summaries and look at the
results.
The lists sites
and list_of_df
are preloaded.
-
Pipe
list_of_df
intomap()
along with thelm()
linear model function, to comparea
as the response andb
as the predictor variable.-
Use the syntax:
lm(response ~ predictor, data = )
-
Use the syntax:
-
Then pipe the linear model output into
map()
and generate thesummary()
of each model.
# Map over the models to look at the relationship of a vs b
%>%
list_of_df map(~ lm(a ~ b, data = .)) %>%
map(summary)
## [[1]]
##
## Call:
## lm(formula = a ~ b, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.660 -1.737 -0.003 1.536 5.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.749413 2.176642 2.641 0.00892 **
## b -0.003124 0.010833 -0.288 0.77338
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.428 on 198 degrees of freedom
## Multiple R-squared: 0.0004197, Adjusted R-squared: -0.004629
## F-statistic: 0.08314 on 1 and 198 DF, p-value: 0.7734
##
##
## [[2]]
##
## Call:
## lm(formula = a ~ b, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3931 -1.9487 0.2521 1.7298 6.9519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.18230 2.53465 0.861 0.390
## b 0.01415 0.01255 1.127 0.261
##
## Residual standard error: 2.625 on 198 degrees of freedom
## Multiple R-squared: 0.006377, Adjusted R-squared: 0.001359
## F-statistic: 1.271 on 1 and 198 DF, p-value: 0.261
##
##
## [[3]]
##
## Call:
## lm(formula = a ~ b, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2321 -1.9745 0.0907 1.9569 7.5872
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.45198 2.63640 0.551 0.582
## b 0.01647 0.01308 1.259 0.210
##
## Residual standard error: 2.82 on 198 degrees of freedom
## Multiple R-squared: 0.007939, Adjusted R-squared: 0.002928
## F-statistic: 1.584 on 1 and 198 DF, p-value: 0.2096
Good work! This will make running multiple models and summarizing their results much easier.
2.2.3 map_chr()
In this exercise, you’ll dive a bit deeper into the different map_()
variants. The map()
function always outputs a list. map_
()
outputs other kinds of information. Study the table below and make sure you’re clear on the type of output for each map_*()
variant.
map_*()
|
Output |
---|---|
map_chr()
|
character vector |
map_lgl()
|
logical vector [TRUE or FALSE] |
map_int()
|
integer vector |
map_dbl()
|
double vector |
-
Compare the results of
map()
andmap_chr()
for thedirector
named elementsw_films
.
# Pull out the director element of sw_films in a list and character vector
map(sw_films, ~.x[["director"]])
## [[1]]
## [1] "George Lucas"
##
## [[2]]
## [1] "George Lucas"
##
## [[3]]
## [1] "George Lucas"
##
## [[4]]
## [1] "George Lucas"
##
## [[5]]
## [1] "Richard Marquand"
##
## [[6]]
## [1] "Irvin Kershner"
##
## [[7]]
## [1] "J. J. Abrams"
map_chr(sw_films, ~.x[["director"]])
## [1] "George Lucas" "George Lucas" "George Lucas" "George Lucas"
## [5] "Richard Marquand" "Irvin Kershner" "J. J. Abrams"
-
Compare the
map()
andmap_lgl()
outputs onsw_films
fordirector == George Lucas
.
# Compare outputs when checking if director is George Lucas
map(sw_films, ~.x[["director"]] == "George Lucas")
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] FALSE
##
## [[6]]
## [1] FALSE
##
## [[7]]
## [1] FALSE
map_lgl(sw_films, ~.x[["director"]] == "George Lucas")
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE
Good work! Mastering the flavors of map_*()
is key for success in purrr
.
2.2.4 map_dbl() and map_int()
Some flavors of map_()
are very similar. map_dbl()
and map_int()
both output numbers. map_int()
outputs integer vectors, which have numbers with no decimals. map_dbl()
outputs double vectors, which have numbers that can have decimals. Take a closer look at how using different map_
()
functions affect outputs.
Here is the map_*()
table again as a reference.
map_*()
|
Output |
---|---|
map_chr()
|
character vector |
map_lgl()
|
logical vector [TRUE or FALSE] |
map_int()
|
integer vector |
map_dbl()
|
double vector |
Compare the map()
and map_dbl()
outputs for pulling out the episode_id
for each element of sw_films
.
# Pull out episode_id element as list
map(sw_films, ~.x[["episode_id"]])
## [[1]]
## [1] 4
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 1
##
## [[4]]
## [1] 3
##
## [[5]]
## [1] 6
##
## [[6]]
## [1] 5
##
## [[7]]
## [1] 7
# Pull out episode_id element as double vector
map_dbl(sw_films, ~.x[["episode_id"]])
## [1] 4 2 1 3 6 5 7
Compare the map()
and map_int()
outputs for pulling out the episode_id
for each element of sw_films
.
# Pull out episode_id element as a list
map(sw_films, ~.x[["episode_id"]])
## [[1]]
## [1] 4
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 1
##
## [[4]]
## [1] 3
##
## [[5]]
## [1] 6
##
## [[6]]
## [1] 5
##
## [[7]]
## [1] 7
# Pull out episode_id element as integer vector
map_int(sw_films, ~.x[["episode_id"]])
## [1] 4 2 1 3 6 5 7
Good work! Now you can output numbers without decimals!
2.3 map2() and pmap()
2.3.1 Simulating data with multiple inputs using map2()
The map()
function is great if you need to iterate over one list, however, you will often need to iterate over two lists at the same time. This is where map2()
comes in. While map()
takes the list as the .x
argument; map2()
takes two lists as two arguments: .x
and .y
.
To test out map2()
, you are going to create a simple
dataset, with one list of numbers and one list of strings. You will put
these two lists together and create some simulated data.
means
, of the values 1 through 3.
# List of 1, 2 and 3
<- list(1,2,3) means
sites
list with “north”, “west”, and “east”.
# Create sites list
<- list("north","west","east") sites
map2()
over the sites
and means
lists to create a dataframe with two columns.
-
First column is
sites
; second column is generated byrnorm()
withmean
from themeans
list.
# Map over two arguments: sites and means
<- map2(sites, means, ~data.frame(sites = .x,
list_of_files_map2 a = rnorm(mean = .y, n = 200, sd = (5/2))))
map(list_of_files_map2,~head(.x))
## [[1]]
## sites a
## 1 north 1.573448
## 2 north -1.371510
## 3 north -2.659407
## 4 north -3.406760
## 5 north 1.178419
## 6 north 4.325915
##
## [[2]]
## sites a
## 1 west -0.3739373
## 2 west 2.0272414
## 3 west 6.0924047
## 4 west 0.2325518
## 5 west 3.3310114
## 6 west 4.0336696
##
## [[3]]
## sites a
## 1 east 0.6694428
## 2 east 4.7842952
## 3 east 1.0683827
## 4 east 2.4413813
## 5 east 1.7298603
## 6 east 1.1281749
Good work! Now you can you two lists together!
2.3.2 Simulating data 3+ inputs with pmap()
What if you need to iterate over three lists? Is there a map3()
? To iterate over more than two lists, whether it’s three, four, or even 20, you’ll need to use pmap()
. However, pmap()
does require us to supply our list arguments a bit differently.
To use pmap()
, you first need to create a master list of all the lists we want to iterate over. The master list is the input for pmap()
. Instead of using .x
or .y
, use the list names as the argument names.
You are going to simulate data one more time, using five lists as inputs, instead of two. Using pmap()
gives you complete control over our simulated dataset, and will allow
you to use two different means and two different standard deviations
along with the different sites.
sites
, means
, means2
, sigma
, and sigma2
lists.
=list(0.5,1,1.5)
means2=list(0.5,1,1.5)
sigma2=list(1,2,3)
sigma# Create a master list, a list of lists
<- list(sites = sites, means = means, sigma = sigma,
pmapinputs means2 = means2, sigma2 = sigma2)
pmap()
over the list of lists, to create a list of dataframes with three columns; the first column is sites
.
-
The second column is
a
, which isrnorm()
withmean = means
, andsd = sigma
. -
The third column is
b
, which isrnorm()
withmean = means2
, andsd = sigma2
.
# Create a master list, a list of lists
<- list(sites = sites, means = means, sigma = sigma,
pmapinputs means2 = means2, sigma2 = sigma2)
# Map over the master list
<- pmap(pmapinputs,
list_of_files_pmap function(sites, means, sigma, means2, sigma2){
data.frame(sites = sites,
a = rnorm(mean = means, n = 200, sd = sigma),
b = rnorm(mean = means2, n = 200, sd = sigma2))})
map(list_of_files_pmap,~head(.x))
## [[1]]
## sites a b
## 1 north 2.06084408 0.06552980
## 2 north 0.81124501 0.04062135
## 3 north -0.09554456 -0.53030463
## 4 north 1.84663711 -0.08551129
## 5 north 2.28165089 0.29531221
## 6 north 0.90107028 0.47888973
##
## [[2]]
## sites a b
## 1 west 1.01714878 0.74688682
## 2 west 0.06912261 -1.39160128
## 3 west 2.05448629 0.75707038
## 4 west 2.50355372 -0.07958812
## 5 west -0.02759826 1.89485152
## 6 west 1.75296353 1.91032890
##
## [[3]]
## sites a b
## 1 east -0.9149364 -0.7953880
## 2 east 6.9436878 0.7516244
## 3 east 6.7850664 -0.7958258
## 4 east 4.7084648 1.4231632
## 5 east 1.2073573 4.6877686
## 6 east 5.5571927 2.5614575
Good work! With pmap()
you now have all the power in purrr
.
3 Troubleshooting lists with purrr
Like anything in R, understanding how to troubleshoot issues is an important skill set. This can be particularly important with lists, where finding the problem can be tricky.
3.1 How to purrr safely()
3.1.1 safely() replace with NA
If you map()
over a list, and one of the elements does not
have the right data type, you will not get the output you expect.
Perhaps you are trying to do a mathematical operation on each element,
and it turns out one of the elements is a character - it simply won’t
work.
If you have a very large list, figuring out where things went wrong, and what exactly went wrong can be hard. That is where safely()
comes in; it shows you both your results and where the errors occurred in your map()
call.
-
Use
safely()
withlog()
. This will fail to work on -10, so we’ll pipe it intotranspose()
to put the results first.
# Map safely over log
<- list(-10, 1, 10, 0) %>%
a map(safely(log, otherwise = NA_real_)) %>%
# Transpose the result
transpose()
## Warning in .f(...): NaNs produced
-
Print out
a
.
# Print the list
a
## $result
## $result[[1]]
## [1] NaN
##
## $result[[2]]
## [1] 0
##
## $result[[3]]
## [1] 2.302585
##
## $result[[4]]
## [1] -Inf
##
##
## $error
## $error[[1]]
## NULL
##
## $error[[2]]
## NULL
##
## $error[[3]]
## NULL
##
## $error[[4]]
## NULL
-
Print out the “result” element of
a
.
# Print the result element in the list
"result"]] a[[
## [[1]]
## [1] NaN
##
## [[2]]
## [1] 0
##
## [[3]]
## [1] 2.302585
##
## [[4]]
## [1] -Inf
-
Print out just the error messages from
a
.
# Print the error element in the list
"error"]] a[[
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
Good work! Now you have the power to start debugging your lists, and you can do it with simple element subsetting.
3.1.2 Convert data to numeric with purrr
In the sw_people
dataset, some of the Star Wars characters
have unknown heights. If you want to do some data exploration and
determine how character height differs depending on their home planet,
you need to write your code so that R understands the difference between
heights and missing values. Currently, the missing values are entered
as “unknown”
, but you would like them as NA
. In this exercise, you will combine map()
and ifelse()
to fix this issue.
sw_people
dataset.
# Load sw_people data
data(sw_people)
sw_people
and pull out “height”
.
NA
, otherwise, convert the value into a number with as.numeric()
.
# Map over sw_people and pull out the height element
<- map(sw_people, "height") %>%
height_cm map(function(x){
ifelse(x == "unknown",NA,
as.numeric(x))
})
Good work! Now you can use purrr
for data wrangling to help clean numeric data in lists.
3.1.3 Finding the problem areas
When you are working with a small list, it might not seem like a lot of work to go through things manually and figure out what element has an issue. But if you have a list with hundreds or thousands of elements, you want to automate that process.
Now you’ll look at a situation with a larger list, where you can see how the error message can be useful to check through the entire list for issues.
map()
over sw_people
and pull out the “height” element.
map()
over safely()
to convert the heights from centimeters into feet.
quiet = FALSE
so that errors are printed.
# Map over sw_people and pull out the height element
<- map(sw_people, "height") %>%
height_ft map(safely(function(x){
* 0.0328084
x quiet = FALSE)) %>%
}, transpose()
transpose()
, to print the results first.
# Print your list, the result element, and the error element
#height_ft
#height_ft[["result"]]
#height_ft[["error"]]
Good work! Now you are ready to troubleshoot lists too large to check by hand.
3.2 Another way to possibly() purrr
3.2.1 Replace safely() with possibly()
Once you have figured out how to solve an issue with safely()
, (e.g., output an NA
in place of an error), swap out safely()
with possibly()
. possibly()
will run through your code and implement your desired changes without printing out the error messages.
You’ll now map()
over log()
again, but you will use possibly()
instead of safely()
since you already know how to resolve your errors.
- Create a list with the values -10, 1, 10, and 0.
-
map()
over this list to take thelog()
of each element, usingpossibly()
. -
Use
NA_real_
to fix any elements that are not the right data type.
# Take the log of each element in the list
<- list(-10, 1, 10, 0) %>%
a map(possibly(function(x){
log(x)
NA_real_)) },
## Warning in log(x): NaNs produced
Good work! Now you can solve issues in lists using safely()
, and then continue with your analysis using possibly()
.
3.2.2 Convert values with possibly()
Let’s say you need to convert the Star Wars character heights in sw_people
from centimeters to feet. You already know that some of the heights have missing data, so you will use possibly()
to convert missing values into NA
. Then you will multiply each of the existing values by 0.0328084 to convert them from centimeters into feet.
To get a feel for your data, print out height_cm
in the console to check out the heights in centimeters.
-
Pipe the
height_cm
object into amap_*()
function that returns double vectors. -
Convert each element in
height_cm
into feet (multiply it by 0.0328084). -
Since not all elements are numeric, use
possibly()
to replace instances that do not work withNA_real_
.
# Create a piped workflow that returns double vectors
%>%
height_cm map_dbl(possibly(function(x){
# Convert centimeters to feet
* 0.0328084
x NA_real_)) },
## [1] 5.643045 5.479003 3.149606 6.627297 4.921260 5.839895 5.413386 3.182415
## [9] 6.003937 5.971129 6.167979 5.905512 7.480315 5.905512 5.675853 5.741470
## [17] 5.577428 5.905512 2.165354 5.577428 6.003937 6.561680 6.233596 5.807087
## [25] 5.741470 5.905512 4.921260 NA 2.887139 5.249344 6.332021 6.266404
## [33] 5.577428 6.430446 7.349082 6.758530 6.003937 4.494751 3.674541 6.003937
## [41] 5.347769 5.741470 5.905512 5.839895 3.083990 4.002625 5.347769 6.167979
## [49] 6.496063 6.430446 5.610236 6.036746 6.167979 8.661418 6.167979 6.430446
## [57] 6.069554 5.150919 6.003937 6.003937 5.577428 5.446194 5.413386 6.332021
## [65] 6.266404 6.003937 5.511811 6.496063 7.513124 6.988189 5.479003 2.591864
## [73] 3.149606 6.332021 6.266404 5.839895 7.086614 7.677166 6.167979 5.839895
## [81] 6.758530 NA NA NA NA NA 5.413386
Good work! Using possibly()
helps us work with problem data in a really clean and efficient way.
3.3 purrr is a walk() in the park
3.3.1 Comparing walk() vs no walk() outputs
Printing out lists with map()
shows a lot of bracketed text
in the console, which can be useful for understanding their structure,
but this information is usually not important for communicating with your end users. If you need to print, using walk()
prints out lists in a more compact and human-readable way, without all those brackets. walk()
is also great for printing out plots without printing anything to the console.
Here, you’ll be using the people_by_film
dataset, which dataset derived from sw_films
that has the url of each character and the film they appear in.
Print people_by_film
to the console.
# Print normally
=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRObsvb_OQ7qeXRvkTEbWBbQcYfyebglhoxAt9cIdRzH7Exf5s-mMqSgtjkHC0qNgK4PVsku7Q0bwfS/pub?gid=0&single=true&output=csv")
people_by_film%>% head() people_by_film
## url film_url
## 1 http://swapi.co/api/people/1/ http://swapi.co/api/films/6/
## 2 http://swapi.co/api/people/1/ http://swapi.co/api/films/3/
## 3 http://swapi.co/api/people/1/ http://swapi.co/api/films/2/
## 4 http://swapi.co/api/people/1/ http://swapi.co/api/films/1/
## 5 http://swapi.co/api/people/1/ http://swapi.co/api/films/7/
## 6 http://swapi.co/api/people/2/ http://swapi.co/api/films/5/
Print out people_by_film
using walk()
and print()
.
# Print with walk
walk(people_by_film, print)
## [1] "http://swapi.co/api/people/1/" "http://swapi.co/api/people/1/"
## [3] "http://swapi.co/api/people/1/" "http://swapi.co/api/people/1/"
## [5] "http://swapi.co/api/people/1/" "http://swapi.co/api/people/2/"
## [7] "http://swapi.co/api/people/2/" "http://swapi.co/api/people/2/"
## [9] "http://swapi.co/api/people/2/" "http://swapi.co/api/people/2/"
## [11] "http://swapi.co/api/people/2/" "http://swapi.co/api/people/3/"
## [13] "http://swapi.co/api/people/3/" "http://swapi.co/api/people/3/"
## [15] "http://swapi.co/api/people/3/" "http://swapi.co/api/people/3/"
## [17] "http://swapi.co/api/people/3/" "http://swapi.co/api/people/3/"
## [19] "http://swapi.co/api/people/4/" "http://swapi.co/api/people/4/"
## [21] "http://swapi.co/api/people/4/" "http://swapi.co/api/people/4/"
## [23] "http://swapi.co/api/people/5/" "http://swapi.co/api/people/5/"
## [25] "http://swapi.co/api/people/5/" "http://swapi.co/api/people/5/"
## [27] "http://swapi.co/api/people/5/" "http://swapi.co/api/people/6/"
## [29] "http://swapi.co/api/people/6/" "http://swapi.co/api/people/6/"
## [31] "http://swapi.co/api/people/7/" "http://swapi.co/api/people/7/"
## [33] "http://swapi.co/api/people/7/" "http://swapi.co/api/people/8/"
## [35] "http://swapi.co/api/people/9/" "http://swapi.co/api/people/10/"
## [37] "http://swapi.co/api/people/10/" "http://swapi.co/api/people/10/"
## [39] "http://swapi.co/api/people/10/" "http://swapi.co/api/people/10/"
## [41] "http://swapi.co/api/people/10/" "http://swapi.co/api/people/11/"
## [43] "http://swapi.co/api/people/11/" "http://swapi.co/api/people/11/"
## [45] "http://swapi.co/api/people/12/" "http://swapi.co/api/people/12/"
## [47] "http://swapi.co/api/people/13/" "http://swapi.co/api/people/13/"
## [49] "http://swapi.co/api/people/13/" "http://swapi.co/api/people/13/"
## [51] "http://swapi.co/api/people/13/" "http://swapi.co/api/people/14/"
## [53] "http://swapi.co/api/people/14/" "http://swapi.co/api/people/14/"
## [55] "http://swapi.co/api/people/14/" "http://swapi.co/api/people/15/"
## [57] "http://swapi.co/api/people/16/" "http://swapi.co/api/people/16/"
## [59] "http://swapi.co/api/people/16/" "http://swapi.co/api/people/18/"
## [61] "http://swapi.co/api/people/18/" "http://swapi.co/api/people/18/"
## [63] "http://swapi.co/api/people/19/" "http://swapi.co/api/people/20/"
## [65] "http://swapi.co/api/people/20/" "http://swapi.co/api/people/20/"
## [67] "http://swapi.co/api/people/20/" "http://swapi.co/api/people/20/"
## [69] "http://swapi.co/api/people/21/" "http://swapi.co/api/people/21/"
## [71] "http://swapi.co/api/people/21/" "http://swapi.co/api/people/21/"
## [73] "http://swapi.co/api/people/21/" "http://swapi.co/api/people/22/"
## [75] "http://swapi.co/api/people/22/" "http://swapi.co/api/people/22/"
## [77] "http://swapi.co/api/people/23/" "http://swapi.co/api/people/24/"
## [79] "http://swapi.co/api/people/25/" "http://swapi.co/api/people/25/"
## [81] "http://swapi.co/api/people/26/" "http://swapi.co/api/people/27/"
## [83] "http://swapi.co/api/people/27/" "http://swapi.co/api/people/28/"
## [85] "http://swapi.co/api/people/29/" "http://swapi.co/api/people/30/"
## [87] "http://swapi.co/api/people/31/" "http://swapi.co/api/people/32/"
## [89] "http://swapi.co/api/people/33/" "http://swapi.co/api/people/33/"
## [91] "http://swapi.co/api/people/33/" "http://swapi.co/api/people/34/"
## [93] "http://swapi.co/api/people/36/" "http://swapi.co/api/people/36/"
## [95] "http://swapi.co/api/people/37/" "http://swapi.co/api/people/38/"
## [97] "http://swapi.co/api/people/39/" "http://swapi.co/api/people/40/"
## [99] "http://swapi.co/api/people/40/" "http://swapi.co/api/people/41/"
## [101] "http://swapi.co/api/people/42/" "http://swapi.co/api/people/43/"
## [103] "http://swapi.co/api/people/43/" "http://swapi.co/api/people/44/"
## [105] "http://swapi.co/api/people/45/" "http://swapi.co/api/people/46/"
## [107] "http://swapi.co/api/people/46/" "http://swapi.co/api/people/46/"
## [109] "http://swapi.co/api/people/48/" "http://swapi.co/api/people/49/"
## [111] "http://swapi.co/api/people/50/" "http://swapi.co/api/people/51/"
## [113] "http://swapi.co/api/people/51/" "http://swapi.co/api/people/51/"
## [115] "http://swapi.co/api/people/52/" "http://swapi.co/api/people/52/"
## [117] "http://swapi.co/api/people/52/" "http://swapi.co/api/people/53/"
## [119] "http://swapi.co/api/people/53/" "http://swapi.co/api/people/53/"
## [121] "http://swapi.co/api/people/54/" "http://swapi.co/api/people/54/"
## [123] "http://swapi.co/api/people/55/" "http://swapi.co/api/people/55/"
## [125] "http://swapi.co/api/people/56/" "http://swapi.co/api/people/56/"
## [127] "http://swapi.co/api/people/57/" "http://swapi.co/api/people/58/"
## [129] "http://swapi.co/api/people/58/" "http://swapi.co/api/people/58/"
## [131] "http://swapi.co/api/people/59/" "http://swapi.co/api/people/59/"
## [133] "http://swapi.co/api/people/60/" "http://swapi.co/api/people/61/"
## [135] "http://swapi.co/api/people/62/" "http://swapi.co/api/people/63/"
## [137] "http://swapi.co/api/people/63/" "http://swapi.co/api/people/64/"
## [139] "http://swapi.co/api/people/64/" "http://swapi.co/api/people/65/"
## [141] "http://swapi.co/api/people/66/" "http://swapi.co/api/people/67/"
## [143] "http://swapi.co/api/people/67/" "http://swapi.co/api/people/68/"
## [145] "http://swapi.co/api/people/68/" "http://swapi.co/api/people/69/"
## [147] "http://swapi.co/api/people/70/" "http://swapi.co/api/people/71/"
## [149] "http://swapi.co/api/people/72/" "http://swapi.co/api/people/73/"
## [151] "http://swapi.co/api/people/74/" "http://swapi.co/api/people/47/"
## [153] "http://swapi.co/api/people/75/" "http://swapi.co/api/people/75/"
## [155] "http://swapi.co/api/people/76/" "http://swapi.co/api/people/77/"
## [157] "http://swapi.co/api/people/78/" "http://swapi.co/api/people/78/"
## [159] "http://swapi.co/api/people/79/" "http://swapi.co/api/people/80/"
## [161] "http://swapi.co/api/people/81/" "http://swapi.co/api/people/81/"
## [163] "http://swapi.co/api/people/82/" "http://swapi.co/api/people/82/"
## [165] "http://swapi.co/api/people/83/" "http://swapi.co/api/people/84/"
## [167] "http://swapi.co/api/people/85/" "http://swapi.co/api/people/86/"
## [169] "http://swapi.co/api/people/87/" "http://swapi.co/api/people/88/"
## [171] "http://swapi.co/api/people/35/" "http://swapi.co/api/people/35/"
## [173] "http://swapi.co/api/people/35/"
## [1] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/3/"
## [3] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/1/"
## [5] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/5/"
## [7] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [9] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/2/"
## [11] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/5/"
## [13] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [15] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/2/"
## [17] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/7/"
## [19] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/3/"
## [21] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/1/"
## [23] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/3/"
## [25] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/1/"
## [27] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/5/"
## [29] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/1/"
## [31] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/6/"
## [33] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/1/"
## [35] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/5/"
## [37] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [39] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/2/"
## [41] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/5/"
## [43] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [45] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/1/"
## [47] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/3/"
## [49] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/1/"
## [51] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/3/"
## [53] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/1/"
## [55] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/1/"
## [57] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/3/"
## [59] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/3/"
## [61] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/1/"
## [63] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/5/"
## [65] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [67] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/2/"
## [69] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/4/"
## [71] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/3/"
## [73] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/5/"
## [75] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/2/"
## [77] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/2/"
## [79] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/2/"
## [81] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/3/"
## [83] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/3/"
## [85] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/3/"
## [87] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/4/"
## [89] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/4/"
## [91] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/4/"
## [93] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/4/"
## [95] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
## [97] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/5/"
## [99] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
## [101] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/5/"
## [103] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
## [105] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/5/"
## [107] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [109] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
## [111] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/5/"
## [113] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [115] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/4/"
## [117] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/5/"
## [119] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [121] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [123] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [125] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [127] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/5/"
## [129] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [131] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/4/"
## [133] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
## [135] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
## [137] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/5/"
## [139] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/5/"
## [141] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
## [143] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/5/"
## [145] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/5/"
## [147] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
## [149] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
## [151] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/4/"
## [153] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/6/"
## [155] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
## [157] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/6/"
## [159] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [161] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/1/"
## [163] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/6/"
## [165] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/7/"
## [167] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/7/"
## [169] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/7/"
## [171] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/4/"
## [173] "http://swapi.co/api/films/6/"
Good work! Now you can use walk() to make your outputs cleaner and more human-readable.
3.3.2 walk() for printing cleaner list outputs
Now you will try one more use of walk()
, specifically creating plots using walk()
. In the previous exercise, you printed some lists, and you saw that printing lists is much cleaner using walk()
than using the base R way. You can also use walk()
to display multiple plots sequentially.
Here, use your map()
knowledge along with ggplot2
functions to create a graph for the first ten elements of gap_split
and then display each graph with walk()
.
gap_split
dataset.
# Load the gap_split data
data(gap_split)
map2()
over the first 10 elements of gap_split
, and the first 10 names of gap_split
.
# Map over the first 10 elements of gap_split
<- map2(gap_split[1:10],
plots names(gap_split[1:10]),
~ ggplot(.x, aes(year, lifeExp)) +
geom_line() +
labs(title = .y))
walk()
over the new plots object and supply print()
as an argument to print all plots.
# Object name, then function name
walk(plots, print)
Good work! Now you can print out multiple plots easily using walk()
.
4 Problem solving with purrr
Now that you have the building blocks, we will start tackling some more complex data problems with purrr.
4.1 Using purrr in your workflow
4.1.1 Name review
Now, you’ll quickly review how to check if a list has names, and how to
pull out a specific element from a list. Remember, you can use the names()
function to see if a list is named. There are several ways to extract a
named element from a list, but the key difference when working with
dataframes is to remember the [[double bracket]]
syntax.
-
Load the
gh_users
data.
# Load the data
data(gh_users)
-
Examine the names of
gh_users
.
# Check if data has names
names(gh_users)
## NULL
-
Extract the names for each element of
gh_users
.
# Map over name element of list
map(gh_users, ~.x[["name"]])
## [[1]]
## [1] "Gábor Csárdi"
##
## [[2]]
## [1] "Jennifer (Jenny) Bryan"
##
## [[3]]
## [1] "Jeff L."
##
## [[4]]
## [1] "Julia Silge"
##
## [[5]]
## [1] "Thomas J. Leeper"
##
## [[6]]
## [1] "Maëlle Salmon"
Good work, now we have refreshed the basics of named lists, we can dive into our next task.
4.1.2 Setting names
Setting list names makes working with lists much easier in many scenarios; it makes the code easier to read, which is especially important when reviewing code weeks or months later.
Here you are going to work with the gh_repos
and gh_users
datasets and set their names in two different ways. The two methods will give the same result: a list with named elements.
gh_users
using the “name” element and use the map_*()
function that outputs a character vector.
# Name gh_users with the names of the users
<- gh_users %>%
gh_users_named set_names(map_chr(gh_users, "name"))
gh_repos
to see where the owner info is stored.
# Check gh_repos structure
#str(gh_repos)
gh_repos_named
based on the login of the owner of the repo, using the set_names()
and map_*()
functions.
# Name gh_repos with the names of the repo owner
<- gh_repos %>%
gh_repos_named map_chr(~ .[[1]]$owner$login) %>%
set_names(gh_repos, .)
Good work! Sometimes list naming is tricky but purrr makes it simpler by easily extracting the element we want to use as the names.
4.1.3 Asking questions from a list
One of the great things about purrr
is you can easily move
from having a question about the data to an answer, with just a few
lines of code. Here you are going to use the gh_users
data to ask three questions:
- Which user joined GitHub first?
- Are all the repositories user-owned, rather than organization-owned?
- Which user has the most public repositories?
In this exercise, your map_*()
knowledge is really tested, so make sure to reflect on all the different flavors of map_*()
and how they should be used.
Name gh_users
with the “name”
element and sort the “created_at”
element to determine who joined GitHub first.
# Determine who joined github first
map_chr(gh_users, ~.x[["created_at"]]) %>%
set_names(map_chr(gh_users, "name")) %>%
sort()
## Jennifer (Jenny) Bryan Gábor Csárdi Jeff L.
## "2011-02-03T22:37:41Z" "2011-03-09T17:29:25Z" "2012-03-24T18:16:43Z"
## Thomas J. Leeper Maëlle Salmon Julia Silge
## "2013-02-07T21:07:00Z" "2014-08-05T08:10:04Z" "2015-05-19T02:51:23Z"
Output a vector that returns TRUE
for each element where the “type”
is “USER”
.
# Determine user versus organization
map_lgl(gh_users, ~.x[["type"]] == "User")
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
Output a named numeric vector of the number of “public_repos”
.
# Determine who has the most public repositories
map_int(gh_users, ~.x[["public_repos"]]) %>%
set_names(map_chr(gh_users, "name")) %>%
sort()
## Julia Silge Maëlle Salmon Gábor Csárdi
## 26 31 52
## Jeff L. Thomas J. Leeper Jennifer (Jenny) Bryan
## 67 99 168
Good work! Now you can use functions you already know to ask any question of your data in just a few lines of code.
4.2 Even more complex problems
Questions about gh_repos
You’re going to use gh_repos
again, a list where each element is information about a GitHub repository. Here you will use map()
and map_dbl()
to answer the question:
- Which repository is the largest?’
GitHub repository size is measured in megabytes. This information could be useful to document if you are working with a list based dataset that changes over time, and need to be able to pull out information, like the largest repository, in the most recent dataset.
-
map()
overgh_repos
. -
map_dbl()
over the `“size” element. -
Then
map()
to determine which repo is the largest.
# Map over gh_repos to generate numeric output
map(gh_repos,
~map_dbl(.x,
~.x[["size"]])) %>%
# Grab the largest element
map(~max(.x))
## [[1]]
## [1] 39461
##
## [[2]]
## [1] 96325
##
## [[3]]
## [1] 374812
##
## [[4]]
## [1] 24070
##
## [[5]]
## [1] 558176
##
## [[6]]
## [1] 76455
Good work! You’re gaining great skills to be able to answer questions in a reproducible way with your datasets.
4.3 Graphs in purrr
4.3.1 ggplot() refresher
You’ve already been introduced to the package ggplot2
in the prerequisite for this course, but let’s do a quick refresher.
-
geom_point()
makes scatterplots -
geom_histogram()
makes histograms
In this exercise, you are going to use a dataframe created from the gh_users
dataset, called gh_users_df
that has two columns; one for the number of public repositories a user
has and another for how many followers that user has. Each row is a
different user. Then you will make it into a scatter plot, a plot where
the data are displayed with points.
Create a scatterplot with public_repos
on the x
axis and followers
on the y
axis.
=tribble(~public_repos, ~followers,
gh_users_df52, 303,
168, 780,
67, 3958,
26, 115,
99, 213,
31, 34)
# Scatter plot of public repos and followers
ggplot(data = gh_users_df,
aes(x = public_repos, y = followers))+
geom_point()
Create a histogram of followers
by piping in gh_users_df
.
# Histogram of followers
%>%
gh_users_df ggplot(aes(x = followers))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Good work! Isn’t making plots fun? Now let’s dive into how purrr
can help make more of them!
4.3.2 purrr and scatterplots
Since ggplot()
does not accept lists as an input, it can be paired up with purrr
to go from a list to a dataframe to a ggplot()
graph in just a few lines of code.
You will continue to work with the gh_users
data for this exercise. You will use a map_*()
function to pull out a few of the named elements and transform them
into the correct datatype. Then create a scatterplot that compares the
user’s number of followers to the user’s number of public repositories.
-
map()
overgh_users
, use themap_*()
function that creates a dataframe, with four columns, named“login”
,“name”
,“followers”
and“public_repos”
. -
Pipe that dataframe into a scatterplot, where the
x
axis isfollowers
andy
ispublic_repos
.
# Create a dataframe with four columns
map_df(gh_users, `[`,
c("login","name","followers","public_repos")) %>%
# Plot followers by public_repos
ggplot(.,
aes(x = followers, y = public_repos)) +
# Create scatter plots
geom_point()
Good work! Now you can go from list to plot using a tidy workflow!
4.3.3 purrr and histograms
Now you’re going to put together everything you’ve learned, starting
with two different lists, which will be turned into a faceted histogram.
You’re going to work again with the Stars Wars data from the sw_films
and sw_people
datasets to answer a question:
- What is the distribution of heights of characters in each of the Star Wars films?
Different movies take place on different sets of planets, so you might
expect to see different distributions of heights from the characters.
Your first task is to transform the two datasets into dataframes since ggplot()
requires a dataframe input. Then you will join them together, and plot
the result, a histogram with a different facet, or subplot, for each
film.
“title”
of each film, and the “characters”
from each film in the sw_films
dataset.
# Turn data into correct dataframe format
<- tibble(filmtitle = map_chr(sw_films, "title")) %>%
film_by_character mutate(filmtitle, characters = map(sw_films, "characters")) %>%
unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(characters)`
“height”
, “mass”
, “name”
, and “url”
elements from sw_people
.
# Pull out elements from sw_people
<- map_df(sw_people, `[`, c("height","mass","name","url")) sw_characters
“characters
” and “url
” keys.
# Join our two new objects
<- inner_join(film_by_character, sw_characters, by = c("characters" = "url")) %>%
character_data # Make sure the columns are numbers
mutate(height = as.numeric(height), mass = as.numeric(mass))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
ggplot()
histogram with x = height
, faceted by filmtitle
.
# Plot the heights, faceted by film title
ggplot(character_data, aes(x = height)) +
geom_histogram(stat = "count") +
facet_wrap(~ filmtitle)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 6 rows containing non-finite values (stat_count).
Good work! Now you’ve learned all the basics of how you can use purrr
to make tasks that require iteration and working with lists, more manageable, and human readable!
Course Description
Have you ever been wondering what the purrr description (“A functional programming toolkit for R”) refers to? Then, you’ve come to the right place! This course will walk you through the functional programming part of purrr - in other words, you will learn how to take full advantage of the flexibility offered by the .f in map(.x, .f) to iterate other lists, vectors and data.frame with a robust, clean, and easy to maintain code. During this course, you will learn how to write your own mappers (or lambda functions), and how to use predicates and adverbs. Finally, this new knowledge will be applied to a use case, so that you’ll be able to see how you can use this newly acquired knowledge on a concrete example of a simple nested list, how to extract, keep or discard elements, how to compose functions to manipulate and parse results from this list, how to integrate purrr workflow inside other functions, how to avoid copy and pasting with purrr functional tools.
5 Programming with purrr
Do lambda functions, mappers, and predicates sound scary to you? Fear no more! After refreshing your purrr memory, we will dive into functional programming 101, discover anonymous functions and predicates, and see how we can use them to clean and explore data.
5.1 purrr basics - a refresher
5.1.1 Refreshing your purrr memory
Let’s pretend you’re a data analyst working for a web agency. The web-design team has been running a weeklong A/B test that compares the performance of two design proposals for a website, and you’re now in charge of analyzing the results.
The team measured the number of visits to the Contact page to determine the design’s impact on the number of people contacting the company. These designs were presented to 2/3 of visitors.
visit_a
contains the results from campaign A and visit_b
the results of campaign B. Both are expressed as an average hourly
number of visits. All the other stats you have are expressed as visits
per day, so you need to convert these two. Then, you’ll extract the mean
of each vector.
Note that these are new data, not the one from the video.
to_day()
function, which multiplies x
by 24.
# Create the to_day function
<- function(x) {
to_day *24
x }
visit_a
and visit_b
.
=c(117, 147, 131, 73, 81, 134, 121)
visit_a=c(180, 193, 116, 166, 131, 153, 146)
visit_b=c(57, 110, 68, 72, 87, 141, 67)
visit_c# Create a list containing both vectors: all_visits
<- list(visit_a, visit_b) all_visits
map()
and the to_day()
function.
# Convert to daily number of visits: all_visits_day
<- map(all_visits, to_day) all_visits_day
mean()
function on the results.
# Map the mean() function and output a numeric vector
map_dbl(all_visits_day, mean)
## [1] 2756.571 3720.000
Well done! You’re mastering the basic syntax of iteration with purrr with the map()
and map_dbl()
functions. Let’s refresh your memory a little more!
5.1.2 Another purrr refresher
You just received visit_c
, the number of visits on the
website during the same week, but with the old design, which was shown
to 1/3 of website visitors. You now want to compare these visit_c
, with two previous design, visit_a
and visit_b
, to know which one led to more visits of the Contact page.
Again, you’ll need to turn all the visitor lists to the daily number of visits.
You’ve been asked to provide two insights:
- A plot for each element
- The total number of visits for each day, regardless design
You’ll test out both map()
and walk()
for
plotting. Both return the “side effects,” that is to say, the changes in
the environment (drawing plots, downloading a file, changing the
working directory…), but walk()
won’t print anything to the console.
- Create a list containing the three vectors and turn these list elements to the daily number of visits.
# Create all_tests list and modify with to_day() function
<- list(visit_a, visit_b, visit_c)
all_tests <- map(all_tests, to_day) all_tests_day
-
Create three bar plots of
all_tests_day
with one call usingmap()
.
# Plot all_tests_day with map
map(all_tests_day, barplot)
## [[1]]
## [,1]
## [1,] 0.7
## [2,] 1.9
## [3,] 3.1
## [4,] 4.3
## [5,] 5.5
## [6,] 6.7
## [7,] 7.9
##
## [[2]]
## [,1]
## [1,] 0.7
## [2,] 1.9
## [3,] 3.1
## [4,] 4.3
## [5,] 5.5
## [6,] 6.7
## [7,] 7.9
##
## [[3]]
## [,1]
## [1,] 0.7
## [2,] 1.9
## [3,] 3.1
## [4,] 4.3
## [5,] 5.5
## [6,] 6.7
## [7,] 7.9
- Create three plots with one call, without anything printed to the console.
# Plot all_tests_day
walk(all_tests_day, barplot)
-
Get the sum of
all_tests_day
as a list, then check you’ve got a numeric output by printing the class of the object.
# Get the sum, of the all_tests_day list, element by element, and check its class
<- pmap_dbl(all_tests_day, sum)
sum_all class(sum_all)
## [1] "numeric"
Congratulations! We are first using map()
because we want to apply the function to each element of the list. Then, we are using pmap_dbl()
because we need to take sub-element one by one. So, now that we have seen the basics of iteration with purrr
, let’s dive into programming!
5.2 Introduction to mappers
5.2.1 Creating lambda functions
Do you recall the three vectors visit_a
, visit_b
and visit_c
from the A/B test from the last exercise? They are still available in your workspace.
Remember that these vectors contain the hourly visit rate by day. Each element of these vectors corresponds to one design of the website, randomly served to the visitors. We are going to turn these vectors into a daily number of visits, but this time, we’ll use a mapper.
Using a mapper allows you to write reusable code: you will potentially be asked to redo this task, so if you have an already existing mapper, you will be able to reuse this object, instead of copying and pasting the same code again and again.
visit_a
.
# Turn visit_a into daily number using an anonymous function
map(visit_a, function(x) {
* 24
x })
## [[1]]
## [1] 2808
##
## [[2]]
## [1] 3528
##
## [[3]]
## [1] 3144
##
## [[4]]
## [1] 1752
##
## [[5]]
## [1] 1944
##
## [[6]]
## [1] 3216
##
## [[7]]
## [1] 2904
# Turn visit_a into daily number of visits by using a mapper
map(visit_a, ~ .x * 24)
## [[1]]
## [1] 2808
##
## [[2]]
## [1] 3528
##
## [[3]]
## [1] 3144
##
## [[4]]
## [1] 1752
##
## [[5]]
## [1] 1944
##
## [[6]]
## [1] 3216
##
## [[7]]
## [1] 2904
to_day
.
# Create a mapper object called to_day
<- as_mapper(~ .x * 24) to_day
to_day
on the three vectors (make three calls).
# Use it on the three vectors
map(visit_a, to_day)
## [[1]]
## [1] 2808
##
## [[2]]
## [1] 3528
##
## [[3]]
## [1] 3144
##
## [[4]]
## [1] 1752
##
## [[5]]
## [1] 1944
##
## [[6]]
## [1] 3216
##
## [[7]]
## [1] 2904
map(visit_b, to_day)
## [[1]]
## [1] 4320
##
## [[2]]
## [1] 4632
##
## [[3]]
## [1] 2784
##
## [[4]]
## [1] 3984
##
## [[5]]
## [1] 3144
##
## [[6]]
## [1] 3672
##
## [[7]]
## [1] 3504
map(visit_c, to_day)
## [[1]]
## [1] 1368
##
## [[2]]
## [1] 2640
##
## [[3]]
## [1] 1632
##
## [[4]]
## [1] 1728
##
## [[5]]
## [1] 2088
##
## [[6]]
## [1] 3384
##
## [[7]]
## [1] 1608
Well played! You now know a little bit more about lambda functions and mappers, and you’ve used them to transform your dataset. Let’s try again in a new exercise!
5.2.2 Lambda functions
We are still working with the results of a weeklong A/B test on a
website. The three vectors containing the number of visits for each
design (visit_a
, visit_b
and visit_c
) are available in your workspace.
One of your colleagues has asked you to transfer him the results, but he
wants them to be rounded to the nearest ten. To do this, you will need
to call the round()
function this way:
Rounding to a negative number of digits means rounding to a power of ten, so for example round(x, digits = -2)
rounds to the nearest hundred
Definition taken from R documentation: see ?round
Make sure to use the right map_*
for each call.
visit_a
to the nearest ten with a mapper.
# Round visit_a to the nearest tenth with a mapper
map_dbl(visit_a, ~ round(.x, -1))
## [1] 120 150 130 70 80 130 120
to_ten
, that rounds to the nearest ten.
# Create to_ten, a mapper that rounds to the nearest tenth
<- as_mapper(~ round(.x, -1)) to_ten
to_ten
to visit_b
.
# Map to_ten on visit_b
map_dbl(visit_b, to_ten)
## [1] 180 190 120 170 130 150 150
to_ten
to visit_c
.
# Map to_ten on visit_c
map_dbl(visit_c, to_ten)
## [1] 60 110 70 70 90 140 70
Purrrfect! Are you starting to like mappers ;)? In this exercise, you’ve seen how to build a reusable mapper. Using reusable elements (like mappers here) allows to write code which is easier to use and to maintain in the long run.
5.3 Using mappers to clean data
5.3.1 Clean up your data with keep()
Since the beginning of this course, we have been using the results of a weeklong A/B test.
We have put these results in a list called all_visits
. This list contains visit_a
, visit_b
, and visit_c
. These vectors are unnamed. They all contain seven numbers, one for each day of the week.
The first question we want to ask is: which days reached more than 100 visits an hour on average? We will use the keep()
function. But the answer would not be readable with an unnamed vector:
you would have the numbers, but you would not know to which day these
numbers correspond.
The good news is: you can use the set_names()
function to solve this issue. This is what we’ll do in this chapter: first, use keep()
on unnamed vectors, then on named ones.
.x
is more than 100. You’ll use it twice.
# Create a mapper that test if .x is more than 100
<- as_mapper(~ .x > 100) is_more_than_hundred
keep()
, and map it on the unnamed list all_visit
. As the result is unnamed, you don’t know which days you have kept.
# Use this mapper with keep() on the all_visits object
map(all_visits, ~ keep(.x, is_more_than_hundred))
## [[1]]
## [1] 117 147 131 134 121
##
## [[2]]
## [1] 180 193 116 166 131 153 146
map()
and the set_names()
functions, using the vector of names we have provided.
# Use the day vector to set names to all_list
<- c("mon", "tue", "wed", "thu", "fri", "sat", "sun")
day <- map(all_visits, ~ set_names(.x, day)) full_visits_named
# Use this mapper with keep()
map(full_visits_named, ~ keep(.x, is_more_than_hundred))
## [[1]]
## mon tue wed sat sun
## 117 147 131 134 121
##
## [[2]]
## mon tue wed thu fri sat sun
## 180 193 116 166 131 153 146
Great! In this exercise, you’ve learned how to name vectors, and how to construct a reusable mapper to answer questions about your data.
5.3.2 Split up with keep() and discard()
We want to split our results into two groups: the days over 100, and the days under 100. We’ll combine keep()
and discard()
to do so.
Why two functions? Couldn’t we use one function? Couldn’t we create a mapper called is_less_than_hundred
?
We could, but that would be more error-prone: it’s easier to switch fromkeep()
todiscard()
than copying and pasting. By combining both functions, we only need one mapper. That means that if we want to change the threshold, we’ll only need to do it once, not twice, as we would have to do if we had two mappers.
This is a rule you should endeavor to apply when coding: write code so that if you need to change one thing, you will have to change it just once.
all_visits
is still available in your workspace.
set_names()
function on all_visits
to add the name of the days: all_visits_named
.
# Set the name of each subvector
<- c("mon", "tue", "wed", "thu", "fri", "sat", "sun")
day <- map(all_visits, ~ set_names(.x, day)) all_visits_named
threshold
that will test if .x
is over 100.
# Create a mapper that will test if .x is under 100
<- as_mapper(~ .x > 100) threshold
group_over
by keeping the elements that are over 100.
# Run this mapper on the all_visits_named object
<- map(all_visits_named, ~ keep(.x, threshold)) group_over
group_under
by discarding the elements that are over 100.
# Run this mapper on the all_visits_named object
<- map(all_visits_named, ~ discard(.x, threshold)) group_under
Well done! As you can see in this code, if I want to change the threshold, I have to change it once. This is an important feature of good code: do not write code in a way that if you need to change a parameter, you’ll have to change it several times.
5.4 Predicates
5.4.1 What is a predicate?
A predicate function is “a function that either returns TRUE or FALSE.” While a predicate functional “takes a vector and a predicate function and do something useful.”***
In other words, the predicate functionals take in .x
, which is a vector, a dataframe, or a list, and test the predicate on every element of .x
. For example, you can test if every element is numeric with the is.numeric()
predicate from R-Base, or if the mean of some elements is under 5 with this mapper: ~mean(.x) < 5
.
Which of these functions is NOT a predicate?
-
is.character()
-
function(x) x < 5
-
~ .x * 100
-
~ .x < 5
5.4.2 Exploring data with predicates
We will continue our exploration of A/B test data. Your manager is not interested in which days reached the threshold, he wants to know if every day reached the threshold or if some days reached the threshold. We’ll use purrr
predicates to answer these questions.
You have received several thresholds and decided to write a script that will start with this threshold definition, and answer, for each design, if all the days have reached the threshold, and if not, if some did.
The results from this A/B test are in the all_visits
list.
threshold
, that contains the number 160.
# Create a threshold variable that contains 160
<- 160 threshold
.x
is over threshold
.
# Create a mapper that tests if .x is over threshold
<- as_mapper(~ .x > threshold) over_threshold
map()
and every()
to test if all elements are over the threshold.
# Are all elements over the defined threshold?
map(all_visits, ~ every(.x, over_threshold))
## [[1]]
## [1] FALSE
##
## [[2]]
## [1] FALSE
map()
and some()
to test if some elements are over the threshold.
# Are some elements over the defined threshold?
map(all_visits, ~ some(.x, over_threshold))
## [[1]]
## [1] FALSE
##
## [[2]]
## [1] TRUE
Well done! You’ve completed the first chapter of the course. We’ve
played a lot with lists in this first chapter. You may think you won’t
need this purrr
knowledge as you’re only dealing with a
data frame. But good news: as data.frames are lists of same-length
vectors; you can apply all these purrr
methods to a data.frame. We’ll also see in the next chapter how to use purrr
inside data.frames with list-columns. Starting to feel addicted to purrr
? Rendez-vous in the next chapter for more magic!
6 FP: from theory to practice
Ready to go deeper with functional programming and purrr? In this chapter, we’ll discover the concept of functional programming, explore error handling using including safely() and possibly(), and introduce the function compact() for cleaning your code.
6.1 Functional programming in R
6.1.1 Everything that happens is a function call
When you are using R, every computation happens because of a call to a function.
In other words, every operation made on an object is linked to a
function. And you’ve been using functions from the very first day you
started R: <-
is a function, as is [
.
What do you think would be the output of this code?
class(`$`)
-
“object”
-
“function”
-
“character”
-
“operator”
6.1.2 Identifying pure functions
A pure function satisfies two properties:
- Its output only depends on its inputs: when you input a value, the output is always the same.
- It has no side-effect, that is to say, no effect outside the function.
A lot of functions in R are not pure, yet they are vital for a day to day use of R: when doing an analysis, you need to download files, create a plot, save results…
When programming, you should aim at making your functions either as pure as possible or as impure as possible (for example, a function that downloads a file should only download this file). But for that, you first need to be able to recognize a pure function from an impure one.
This is what we’ll do in this exercise: run functions which are either pure or impure, and see what their outputs are.
Run Sys.time()
, then Sys.sleep(1)
, then Sys.time()
again, to see how two calls to the same function can lead to different results.
# Launch Sys.time(), Sys.sleep(1), & Sys.time()
Sys.time()
## [1] "2022-02-18 14:07:38 +07"
Sys.sleep(1)
Sys.time()
## [1] "2022-02-18 14:07:39 +07"
Run nrow(iris)
, then Sys.sleep(1)
, then nrow(iris)
again, to see how these two calls return the same thing, regardless of time.
# Launch nrow(iris), Sys.sleep(1), & nrow(iris)
nrow(iris)
## [1] 150
Sys.sleep(1)
nrow(iris)
## [1] 150
Run ls()
, which lists the objects in the environment. Create a new object called this
, which contains 12, then run ls()
again.
# Launch ls(), create an object, then rerun the ls() function
ls()
## [1] "a" "all_files" "all_files_purrr"
## [4] "all_tests" "all_tests_day" "all_visits"
## [7] "all_visits_day" "all_visits_named" "character_data"
## [10] "day" "files" "film_by_character"
## [13] "full_visits_named" "gap_split" "gh_repos_named"
## [16] "gh_users" "gh_users_df" "gh_users_named"
## [19] "group_over" "group_under" "height_cm"
## [22] "height_ft" "i" "is_more_than_hundred"
## [25] "list_of_df" "list_of_files_map2" "list_of_files_pmap"
## [28] "means" "means2" "numlist"
## [31] "over_threshold" "people_by_film" "plots"
## [34] "pmapinputs" "sigma" "sigma2"
## [37] "sites" "sum_all" "sw_characters"
## [40] "sw_films_named" "sw_people" "threshold"
## [43] "to_day" "to_ten" "visit_a"
## [46] "visit_b" "visit_c" "wesanderson"
<- 12
this ls()
## [1] "a" "all_files" "all_files_purrr"
## [4] "all_tests" "all_tests_day" "all_visits"
## [7] "all_visits_day" "all_visits_named" "character_data"
## [10] "day" "files" "film_by_character"
## [13] "full_visits_named" "gap_split" "gh_repos_named"
## [16] "gh_users" "gh_users_df" "gh_users_named"
## [19] "group_over" "group_under" "height_cm"
## [22] "height_ft" "i" "is_more_than_hundred"
## [25] "list_of_df" "list_of_files_map2" "list_of_files_pmap"
## [28] "means" "means2" "numlist"
## [31] "over_threshold" "people_by_film" "plots"
## [34] "pmapinputs" "sigma" "sigma2"
## [37] "sites" "sum_all" "sw_characters"
## [40] "sw_films_named" "sw_people" "this"
## [43] "threshold" "to_day" "to_ten"
## [46] "visit_a" "visit_b" "visit_c"
## [49] "wesanderson"
Run plot(iris)
, which creates a basic plot of the iris
dataset. See how nothing is printed to the console, and only a side-effect is produced.
# Create a plot of the iris dataset
plot(iris)
Sys.time()
is an extremely impure function, as it will return a different output depending on when you are running it, so is ls()
, which depends on what is in your environment. nrow()
is pure, as the output only depends on the object you’re using as an input, and it has no side effect. Other examples include read.csv()
, which depends on an external source, and if the file changes, the output will change, or plot()
, which is by definiton called for its side-effects.
6.2 Tools for FP in purrr
6.2.1 Safe iterations
As in the previous chapter, let’s pretend you are a data analyst working for a web agency. This time, you’ve been asked to do some web scraping.
(Note: don’t be afraid if you don’t know how to do web scraping, we’ll start simple, and all the functions will be explained).
You have received a list of URLs, but you suspect that some are not real
addresses. The first thing you will do is test if you can connect to
these URLs. For this, we’ll use a simple function from the readr
package: read_lines()
, that we will put inside a safely()
. When given an URL, read_lines()
reads the HTML, or returns an error if the URL is not reachable.
Theurls
vector is available in your workspace. Print it in the console if you want to know what is inside.
read_lines()
function.
# Create a safe version of read_lines()
<- safely(read_lines) safe_read
urls
.
=c("https://thinkr.fr",
urls"https://colinfay.me",
"https://en.wikipedia.org",
"http://cran.r-project.org/")
# Map it on the urls vector
<- map(urls, safe_read) res
set_names()
function.
# Set the name of the results to `urls`
<- set_names(res, urls) named_res
“error”
element of each sublist.
# Extract only the "error" part of each sublist
map(named_res, "error")
## $`https://thinkr.fr`
## NULL
##
## $`https://colinfay.me`
## NULL
##
## $`https://en.wikipedia.org`
## NULL
##
## $`http://cran.r-project.org/`
## NULL
Purrrfect. Thanks to safely(), you were able to iterate over the list of URLs, even if some return errors.
6.2.2 Create a function
We’ve seen how we can use safely()
to identify non-reachable urls in the previous exercise: we wrote a little process that called a safe version of read_lines()
, and returned a list of $errors
.
In this exercise, we’ll try another approach, as we won’t focus on
errors only. Instead of mapping a safe function and extracting the “error”
elements from the results, we will write a helper function that will immediately discard()
the NULL
elements of the output of safe_read()
.
This way, instead of extracting the $error
or $result
part of the output, we’ll be able to know if the elements are reachable (the content is returned in $results
) or if it’s not (then the error is returned in $error
).
The urls
vector has been provided for you.
read_lines()
.
# Create a safe version of read_lines()
<- safely(read_lines) safe_read
safe_read_discard()
that will run the safe version of read_lines()
and discard()
the NULL
elements.
# Code a function that discard() the NULL from safe_read()
<- function(url){
safe_read_discard safe_read(url) %>%
discard(is.null)
}
# Map this function on the url list
<- map(urls, safe_read_discard) res
Nice! You now have a simple function that can tell you if a URL is reachable, or if it returns an error.
6.3 Using possibly()
6.3.1 A possibly() version of read_lines()
We are still working with the series of URLs you were given to scrape. We are trying several methods to identify URLs that can’t be accessed. Why are we doing that? Because the first step of web scraping is analyzing if you can access the URL or not. This is what the code we are writing will be useful for.
In the previous exercise, we wrapped the read_lines()
function inside a safely()
function. In this exercise, we will use the possibly()
function.
In web terminology, a 404 indicates that a web page is not available. This number will be used as the otherwise
argument.
Also, as the read_lines()
returns a vector of length n
when reading a webpage, we’ll collapse paste these using the paste()
function.
The urls
vector has been provided for you.
read_lines()
function in a possibly()
call that would otherwise return 404.
# Create a possibly() version of read_lines()
<- possibly(read_lines, otherwise = 404) possible_read
set_names()
# Map this function on urls, pipe it into set_names()
<- map(urls, possible_read) %>% set_names(urls) res
paste()
function, with the collapse
argument set to ” “
.
# Paste each element of the list
<- map(res, paste, collapse = " ") res_pasted
# Keep only the elements which are equal to 404
keep(res_pasted, ~ .x == 404)
## named list()
Well done! We now have explored another way to detect which urls are not available.
6.3.2 Everything in one call
In order to make this code even more reproducible, we are going to create a function that does it in one call. We have already provided you a skeleton for this function, now it’s your turn to complete it!
In the previous exercises, we have written the process in several steps. Now, we want this to be done in just one call: we’ll then write a function that takes a list of URLs, and return the names of the elements that are not reachable.
Once you have written this function, you could save it, and reuse it whenever you need to clean a list of URLs. And maybe put it into a package ;)
The urls
list from the previous exercise is available in your workspace.
-
Create, inside the
map()
call, apossibly()
version ofread_lines()
that will otherwise return a404
. - Set the names of the output.
-
Use the
paste()
function with thecollapse
argument set to” “
to turn each sublist into a character vector. - Remove the elements which are equal to 404.
<- function(url_list){
url_tester %>%
url_list # Map a version of read_lines() that otherwise returns 404
map( possibly(read_lines, otherwise = 404) ) %>%
# Set the names of the result
set_names( urls ) %>%
# paste() and collapse each element
map(paste, collapse = " ") %>%
# Remove the 404
discard(~ .x == 404) %>%
names() # Will return the names of the good ones
}
# Try this function on the urls object
url_tester(urls)
## [1] "https://thinkr.fr" "https://colinfay.me"
## [3] "https://en.wikipedia.org" "http://cran.r-project.org/"
Perfect! If you have a process that you tend to repeat, it’s better to write a function to do it.
6.4 Handling adverb results
6.4.1 Purrrfecting our function
We are still perfecting our function to detect if a list of URLs contains elements that are not available.
Let’s review what we have coded so far:
-
An error extractor, by combining
safely()
andmap(.x, “error”)
. -
A “non-null” extractor, by combining
safely()
anddiscard(.x, is.null)
. -
A 404 generator, by using
possibly(.x, otherwise = 404)
, which was turned into a function.
We’ll change the behavior of this function a bit: you now want to be able to choose between returning either the results or the errors.
This will allow you to answer two questions with just one function: which are the unreachable URLs, and which are the reachable ones? To do this, you’ll add a parameter called “type” inside this function.
The urls
vector and safe_read()
are available in your workspace.
Complete the function definition.
-
Map
safe_read()
to the list of URLs. - Set the names of the result to the list of URLs.
- Transpose the result into a list of $result and $error.
-
Use
pluck()
to extract thetype
element.
# Complete the function definition
<- function(url_list, type = c("result", "error")) {
url_tester <- match.arg(type)
type %>%
url_list # Apply safe_read to each URL
map(safe_read) %>%
# Set the names to the URLs
set_names(url_list) %>%
# Transpose
transpose() %>%
# Pluck the type element
pluck(type)
}
# Try this function on the urls object
url_tester(urls, type = "error")
## $`https://thinkr.fr`
## NULL
##
## $`https://colinfay.me`
## NULL
##
## $`https://en.wikipedia.org`
## NULL
##
## $`http://cran.r-project.org/`
## NULL
By combining safely()
and transpose()
, you’ve written a flexible function: here you can focus either on the results or on the errors.
6.4.2 Extracting status codes with GET()
For this last exercise, we’ll switch from the read_lines()
function to the GET()
function from httr
.
We’ll first create a possibly()
version of GET()
,
in order to test if some of the URLs you’ve got return an error. If you
can access the URL, a connection object will be returned. In it, you’ll
find a “status_code”
element.
Don’t focus on the results, just remember that if a GET()
function returns an error, it’s because the URL is not available. The
status code number we are returning can appear a bit like web jargon,
but we’ll talk about it with more depth in the next chapter. Just
remember, for now, that 200 means everything went as expected.
The urls
vector is available in your workspace, purrr
and httr
has been loaded for you.
-
Create a version of
GET()
that would returnNULL
in case of error. - Set the names of the results.
-
Remove the
NULL
. -
Extract the
“status_code”
of each element.
library(httr)
<- function(url_list){
url_tester %>%
url_list # Create a possibly() version of GET() that would otherwise return NULL
map( possibly(GET, NULL) ) %>%
# Set the names of the result
set_names( urls ) %>%
# Remove the NULL
compact() %>%
# Extract all the "status_code" elements
map("status_code")
}
# Try this function on the urls object
url_tester(urls)
## $`https://thinkr.fr`
## [1] 200
##
## $`https://colinfay.me`
## [1] 200
##
## $`https://en.wikipedia.org`
## [1] 200
##
## $`http://cran.r-project.org/`
## [1] 200
Great! We have seen in this chapter how to write custom functions which can help you when doing data analysis: for example, it’s crucial when you are doing web scraping, to ensure that the urls you want to scrape are reachable. Now you now how to do this ;)
7 Better code with purrr
In this chapter, we’ll use purrr to write code that is clearer, cleaner, and easier to maintain. We’ll learn how to write clean functions with compose() and negate(). We’ll also use partial() to compose functions by “prefilling” arguments from existing functions. Lastly, we’ll introduce list-columns, which are a convenient data structure that helps us write clean code using the Tidyverse.
7.1 Why cleaner code?
7.1.1 How to write compose()
When you use compose()
, the functions are passed from right
to left — that is to say in the same order as the one you would use in a
nested call in base R: the first function to be executed is the
function on the right.
In other words, if you are used to the pipe, the order is the opposite one:
``` r
With the pipe
1:28 %>% mean() %>% round()
In base R
round(mean(1:28))
With compose
roundedmean <- compose(round, mean) rounded
So, what’s the correct way to write a function that will count the number of NA
?
-
compose(is.na, sum)
-
compose(sum, is.na)
-
compose(is.na(), sum())
-
compose(sum(), is.na())
7.1.2 Back to the office
You are still working as a data analyst for a web agency, and you’ve been asked to do web scraping. You have been given a list of URLs to analyze, an analysis you’ve already started in the previous chapter.
You expect this task to be recurrent: no doubt you’ll be asked to do it again in a few weeks. In order to make your future work easier, you’ve decided to try and write clean code today, so that it will be easier to come back to it later.
We’ll start by combining the two functions from httr
we’ve seen in the previous chapter: GET()
, for retrieving the webpage, and status_code()
, to extract the status code, in order to create a status code extractor.
The urls
vector is still available in your workspace. We have kept only the URLs that are reachable.
purrr
and httr
.
# Launch purrr and httr
library(purrr)
library(httr)
GET()
and status_code()
.
# Compose a status extractor
<- compose(status_code, GET) status_extract
# Try with "https://thinkr.fr" & "https://en.wikipedia.org"
status_extract("https://thinkr.fr")
## [1] 200
status_extract("https://en.wikipedia.org")
## [1] 200
urls
.
# Map it on the urls vector, return a vector of numbers
map_dbl(urls, status_extract)
## [1] 200 200 200 200
Nice! We have used purrr
to quickly create a combination of
two functions! And good news: all the websites we have tried to reach
returned a 200 status code, meaning we were able to connect to all them
without any problem.
7.2 compose() and negate()
7.2.1 Build a function
You’re still trying to perfect your tools for doing webs scraping to be as efficient as possible doing your job as a data analyst for a web agency.
In this exercise, you will make the extractor function from the previous
exercise a little bit stricter: if the code returned by the status
extractor is not between 200 and 203, the function will return a missing
value (NA
). In the other case, the status code will be returned.
purrr
and httr
have been loaded for you.
%in%
operator, which is used to test if the element on the left is inside the element of the right.
# Negate the %in% function
`%not_in%` <- negate(`%in%`)
extract_status()
function, which will be a combination of GET()
and status_code()
.
# Compose a status extractor
<- compose(status_code, GET) extract_status
url
status code should be extracted and assigned to a code
variable. Then if this code
is not in 200:203
, a missing value will be returned. Otherwise, the status code is returned.
# Complete the function definition
<- function(url) {
strict_code # Extract the status of the URL
<- extract_status(url)
code # If code is not in the acceptable range ...
if (code %not_in% 200:203) {
# then return NA
return(NA)
}
code }
Good work! We now have a stricter version of our status code extractor. Let’s try it on a vector or urls!
7.2.2 Count the NA
Now that you have a stricter version of the status code extractor, we’ll try it on our list of URLs.
What we want to do here is to see which of the websites from our list
return a status code which is not between 200 and 203. To achieve this
task, we’ll flip the is.na()
function, that is to say that instead of returning TRUE
if the value is missing, it will return FALSE
.
The urls
vector and the strict_code()
function are available in your workspace. httr
and purrr
has been loaded for you.
strict_code()
against the vector of urls.
# Map the strict_code function on urls
<- map_dbl(urls, strict_code) res
set_names()
function, using the urls
vector.
# Set the names of the results
<- set_names(res, urls) res_named
is.na()
function by negating its behavior.
# Negate the is.na function
<- negate(is.na) is_not_na
is_not_na()
function on the vector of results.
# Run is_not_na on the results
is_not_na(res_named)
## https://thinkr.fr https://colinfay.me
## TRUE TRUE
## https://en.wikipedia.org http://cran.r-project.org/
## TRUE TRUE
See how clear this code is? There is not that many lines of code, and it’s pretty clear what the intent of each line is.
7.3 Prefilling functions
7.3.1 A content extractor
In the previous exercises, you have established that all the elements from the URLs vector you were given return a 200 status code. Now that you know that they are accessible, you will dig deeper into the web scraping, by doing some content extraction.
To do this, we’ll use functions from the rvest
package, which will be prefilled with partial()
. The functions we will write in this exercise will extract all the H2
HTML nodes from a page — on a webpage, these H2
nodes correspond to the level 2 headers. Once we have extracted these titles, the html_text()
function will be used to extract the text content from the raw HTML.
purrr
and rvest
has been loaded for you, and the urls
vector is available in your workspace.
html_nodes()
with css = “h2”
.
# Prefill html_nodes() with the css param set to h2
<- partial(html_nodes, css = "h2") get_h2
read_html
and html_text
, to create a text extractor for H2
headers.
# Combine the html_text, get_h2 and read_html functions
<- compose(html_text, get_h2, read_html) get_content
urls
vector, and name the result.
# Map get_content to the urls list
<- map(urls, get_content) %>%
res set_names(urls)
# Print the results to the console
res
## $`https://thinkr.fr`
## [1] "\n"
## [2] "\n"
## [3] "Nos formations Certifiantes à R sont finançables à 100% via le CPF"
## [4] "R niveau 3 – Développeur – Conception d’interfaces Shiny – Formation certifiante mars 2022"
## [5] "Comment faire ses templates RMarkdown et Shiny ?"
## [6] "\nAfficher le numéro01 85 09 14 03\n"
## [7] "Des formateurs amouReux"
## [8] "Bénéficiez d'une formation sur-mesure pour vous et votre équipe"
## [9] "Les différents moyens de faire financer votre formation."
## [10] "“De la Création au Déploiement d’Applications {shiny} avec {golem}”"
##
## $`https://colinfay.me`
## character(0)
##
## $`https://en.wikipedia.org`
## [1] "From today's featured article" "Did you know ..."
## [3] "In the news" "On this day"
## [5] "From today's featured list" "Today's featured picture"
## [7] "Other areas of Wikipedia" "Wikipedia's sister projects"
## [9] "Wikipedia languages" "Navigation menu"
##
## $`http://cran.r-project.org/`
## character(0)
Well played! You now have a nice process to extract content from a webpage.
7.3.2 Another extractor
In the previous exercise, we built a function that was able to extract the text content from H2
headers.
We’ll try something else here: we want to extract all the links that
exist on a specific page. To do this, we will need to call two httr
functions: html_nodes()
, with the css
argument set to “a”
(a
is the HTML tag for links) and html_attr()
, which extract a given attribute from a node — in our case, this attribute will be “href”
, which is the link address.
purrr
and rvest
has been loaded for you. You can still find the urls
vector in your workspace.
html_nodes()
with the css
argument set to “a”
.
# Create a partial version of html_nodes(), with the css param set to "a"
<- partial(html_nodes, css = "a") get_a
href()
function, which will be a prefilled version of html_attr()
.
# Create href(), a partial version of html_attr()
<- partial(html_attr, name = "href") href
href()
, get_a()
and read_html()
.
# Combine href(), get_a(), and read_html()
<- compose(href, get_a, read_html) get_links
urls
vector.
# Map get_links() to the urls list
<- map(urls, get_links) %>%
res set_names(urls)
# See the result
map(res,~head(.x))
## $`https://thinkr.fr`
## [1] "https://thinkr.fr"
## [2] "#"
## [3] "https://thinkr.fr/notre-vision-de-la-formation/"
## [4] "https://thinkr.fr/notre-vision-de-la-formation/"
## [5] "https://thinkr.fr/equipe/"
## [6] "https://thinkr.fr/blog/"
##
## $`https://colinfay.me`
## [1] "https://thinkr.fr/" "/" "/categories/"
## [4] "/about/" "/talks-publications/" "/open-source/"
##
## $`https://en.wikipedia.org`
## [1] NA "#mw-head" "#searchInput"
## [4] "/wiki/Wikipedia" "/wiki/Free_content" "/wiki/Encyclopedia"
##
## $`http://cran.r-project.org/`
## [1] "navbar.html"
Well played! See how easy it is to write a web mining function with just a few lines of code?
7.4 List columns
7.4.1 About list-columns
You’ve been introduced in the video to a new kind of data structure:
list columns. List-columns are, as their name suggests, columns which
behave like lists, but are inside a special kind of dataframe — a tibble
, which are an implementation of dataframe used in the tidyverse.
Nested dataframes — dataframes with list-columns, look like standard dataframes, but cells of that columns are not of length 1, and can contain any kind of elements. Just like a list.
df <- data.frame(
classic = c("a", "b","c"),
list = list(
c("a", "b","c"),
c("a", "b","c", "d"),
c("a", "b","c", "d", "e")
)
)
df
# A tibble: 3 x 2
classic list
<chr> <list>
1 a <chr [3]>
2 b <chr [4]>
3 c <chr [5]>
But why is this a useful format?
-
To sound cool on Twitter.
-
They print pretty in the console.
-
To combine tools like
dplyr
and the flexibility of lists.
7.4.2 Create a list-column data.frame
Let’s end our chapter with an implementation of our links extractor, but using a list-column. The idea when using a nested dataframe (i.e., dataframe with a list column) is to keep everything inside a dataframe so that the workflow stays tidy.
You have been provided a tibble called df
, which has a column urls
with the four URLs you’ve been using since the beginning of this
chapter. If you want to have a look at this dataframe, feel free to
print it in the console.
We are going to create a new column called links
, which contains the results of the get_links()
function (available in your workspace). As the outputs of this function
have different lengths, the output will be a list column that you will
then need to unnest()
to get back a standard dataframe.
dplyr
, tidyr
, and purrr
# Load dplyr, tidyr, and purrr
library(dplyr)
library(tidyr)
library(purrr)
df
element, and run mutate()
on it. mutate()
will map the get_links()
function on the urls
column.
=data.frame(urls=urls)
df# Create a column named links with mutate(), that maps get_links() on urls
<- df %>%
df2 mutate(links = map(urls, get_links))
# Print df2 to see what it looks like
df2
## urls
## 1 https://thinkr.fr
## 2 https://colinfay.me
## 3 https://en.wikipedia.org
## 4 http://cran.r-project.org/
## links
## 1 https://thinkr.fr, #, https://thinkr.fr/notre-vision-de-la-formation/, https://thinkr.fr/notre-vision-de-la-formation/, https://thinkr.fr/equipe/, https://thinkr.fr/blog/, https://thinkr.fr/faq/, https://thinkr.fr/recrutement/, https://thinkr.fr/equipe/, https://thinkr.fr/diagnostic-et-accompagnement-a-lexploitation-de-la-donnee/, https://thinkr.fr/analyse-de-donnees/, https://thinkr.fr/visualisation-et-communication/, https://thinkr.fr/collecte-de-donnees/, https://thinkr.fr/bases-de-donnees/, https://thinkr.fr/formation-au-logiciel-r/, https://thinkr.fr/formation-au-logiciel-r/rs5073-analyse-statistique-de-donnees-avec-le-langage-r/, https://thinkr.fr/formation-au-logiciel-r/creation-de-packages-r/, https://thinkr.fr/formation-au-logiciel-r/formation-shiny/, https://thinkr.fr/formation-au-logiciel-r/formation-sur-mesure/, https://thinkr.fr/formation-au-logiciel-r/introduction-et-remise-a-niveau-langage-r/, https://thinkr.fr/formation-au-logiciel-r/developper-package-r/, https://thinkr.fr/formation-au-logiciel-r/suivi-de-version-avec-git-et-rstudio/, https://thinkr.fr/formation-au-logiciel-r/creation-de-graphiques-avec-ggplot2/, https://thinkr.fr/formation-au-logiciel-r/cartographie-et-sig-avec-r/, https://rtask.thinkr.fr/fr/, https://rtask.thinkr.fr/fr/usecase/, https://thinkr.fr/blog/, https://rtask.thinkr.fr, https://thinkr.fr/formation-r/, https://thinkr.fr/formation-au-logiciel-r/, https://thinkr.fr/formation-au-logiciel-r/analyser-des-donnees-avec-r/, #, https://thinkr.fr/formation-au-logiciel-r/, https://www.moncompteformation.gouv.fr/espace-prive/html/#/formation/recherche/81006451900020_certifR_1/81006451900020_certifR_1d_042020, https://thinkr.fr/formation-au-logiciel-r/rs5073-analyse-statistique-de-donnees-avec-le-langage-r/, https://thinkr.fr/formation-au-logiciel-r/conception-dinterfaces-shiny/, https://thinkr.fr/formation-au-logiciel-r/creation-de-packages-r/, https://thinkr.fr/contact/, https://thinkr.fr/formation-au-logiciel-r/, https://thinkr.fr/formation-au-logiciel-r/formation-shiny/, https://thinkr.fr/formation-au-logiciel-r/, https://thinkr.fr/comment-faire-ses-templates-rmarkdown-et-shiny/, https://thinkr.fr/ressources/, tel:0185091403, #, https://thinkr.fr/notre-vision-de-la-formation/, https://thinkr.fr/faq/, https://thinkr.fr/calendrier/formation/, https://thinkr.fr/equipe/, https://thinkr.fr/faq/, #, #, #, #, #, #, #, #, #, #, #, #, #, #, #, #, #, #, #, #, #, #, #, #, #, #, #, #, https://rtask.thinkr.fr, https://twitter.com/thinkR_fr, https://www.meetup.com/fr-FR/R-Lille/, https://thinkr.fr/calendrier/conference/, https://thinkr.fr, https://thinkr.fr/notre-vision-de-la-formation/, https://thinkr.fr/blog/, https://thinkr.fr/faq/, https://rtask.thinkr.fr/fr/contributions-open-source/, https://thinkr.fr/equipe/, https://thinkr.fr/recrutement/, https://thinkr.fr/formation-au-logiciel-r/, https://rtask.thinkr.fr, https://twitter.com/thinkR_fr, https://github.com/ThinkR-open, https://www.meetup.com/fr-FR/rparis/, https://thinkr.fr/contact/, tel:0185091403, https://thinkr.fr/mentions-legales/, https://thinkr.fr/qualiopi, https://www.avis-verifies.com/avis-clients/thinkr.fr, /cdn-cgi/l/email-protection
## 2 https://thinkr.fr/, /, /categories/, /about/, /talks-publications/, /open-source/, https://engineering-shiny.org/, /search/, /aoc-2021-02/, /aoc-2021-01/, /engineering-shiny-print/, /post-request-shiny-app-brochure/, /brochure-r-package/, /aoc-2020-09/, /aoc-2020-08/, /aoc-2020-07/, /aoc-2020-06/, /aoc-2020-05/, /aoc-2020-04/, /aoc-2020-03/, /aoc-2020-02/, /aoc-2020-01/, /we-run-rladies/, /run-rladies/, /r-package-npm/, /hexmake-shiny-contest/, /clients-db/, /hello-hordes/, #, #, /page2/, /page3/, #, /page7/, /page2/, https://twitter.com/_ColinFay, https://github.com/ColinFay, https://www.linkedin.com/in/colinfay, mailto:, /feed.xml, https://jekyllrb.com, https://mademistakes.com/work/minimal-mistakes-jekyll-theme/, https://www.r-bloggers.com/, http://www.rweekly.org, https://creativecommons.org/licenses/by-nc-sa/4.0//, https://opensource.org/licenses/mit-license.php
## 3 NA, #mw-head, #searchInput, /wiki/Wikipedia, /wiki/Free_content, /wiki/Encyclopedia, /wiki/Help:Introduction_to_Wikipedia, /wiki/Special:Statistics, /wiki/English_language, /wiki/Portal:The_arts, /wiki/Portal:Biography, /wiki/Portal:Geography, /wiki/Portal:History, /wiki/Portal:Mathematics, /wiki/Portal:Science, /wiki/Portal:Society, /wiki/Portal:Technology, /wiki/Wikipedia:Contents/Portals, /wiki/File:Richard_II_of_England.jpg, /wiki/Wonderful_Parliament, /wiki/Legislative_session, /wiki/Parliament_of_England, /wiki/Westminster_Abbey, /wiki/Richard_II_of_England, /wiki/Favourite, /wiki/Hundred_Years%27_War, /wiki/Lord_Chancellor, /wiki/Michael_de_la_Pole,_1st_Earl_of_Suffolk, /wiki/Impeachment, /wiki/Wonderful_Parliament, /wiki/Ur-Quan, /wiki/SS_Choctaw, /wiki/David_Berman_(musician), /wiki/Wikipedia:Today%27s_featured_article/February_2022, https://lists.wikimedia.org/postorius/lists/daily-article-l.lists.wikimedia.org/, /wiki/Wikipedia:Featured_articles, /wiki/File:Cairo-citadel-1800s.jpg, /wiki/Baha_al-Din_Qaraqush, /wiki/Cairo_Citadel, /wiki/Saladin, /wiki/Canada_v_United_States_(2012_Summer_Olympics), /wiki/Canada_women%27s_national_soccer_team, /wiki/United_States_women%27s_national_soccer_team, /wiki/Henry_Fitzcount, /wiki/Fifth_Crusade, /wiki/Svalbard_Minute_by_Minute, /wiki/Economy_of_Svalbard#Tourism, /wiki/Dominion:_An_Anthology_of_Speculative_Fiction_From_Africa_and_the_African_Diaspora, /wiki/The_1619_Project, /wiki/Balbuena_metro_station, /wiki/Mexico_City_Metro_PCCI_fire, /wiki/Nathan_Safir, /wiki/KXTN_(AM), /wiki/Squatting_in_Hamburg, /wiki/Erotic_Art_Museum_(Hamburg), /wiki/Wikipedia:Recent_additions, /wiki/Help:Your_first_article, /wiki/Template_talk:Did_you_know, /wiki/File:Cooper_Kupp.jpg, /wiki/American_football, /wiki/Los_Angeles_Rams, /wiki/Cincinnati_Bengals, /wiki/Super_Bowl_LVI, /wiki/Super_Bowl_Most_Valuable_Player_Award, /wiki/Cooper_Kupp, /wiki/Cyclone_Batsirai, /wiki/Association_football, /wiki/2021_Africa_Cup_of_Nations, /wiki/Senegal_national_football_team, /wiki/Egypt_national_football_team, /wiki/2021_Africa_Cup_of_Nations_Final, /wiki/Playback_singer, /wiki/Lata_Mangeshkar, /wiki/Portal:Current_events, /wiki/COVID-19_pandemic, /wiki/2021%E2%80%932022_Russo-Ukrainian_crisis, /wiki/2022_Winter_Olympics, /wiki/Deaths_in_2022, /wiki/Ronald_Lou-Poy, /wiki/Gail_Halvorsen, /wiki/Luigi_De_Magistris_(cardinal), /wiki/Aled_Roberts, /wiki/Valerie_Boyd, /wiki/Raees_Mohammad, /wiki/Wikipedia:In_the_news/Candidates, /wiki/February_18, /wiki/File:Pajol.jpg, /wiki/Pierre_Claude_Pajol, /wiki/1766, /wiki/Malagasy_people, /wiki/Dutch_East_India_Company, /wiki/Meermin, /wiki/Meermin_slave_mutiny, /wiki/Cape_Agulhas, /wiki/1814, /wiki/War_of_the_Sixth_Coalition, /wiki/Napoleon, /wiki/Battle_of_Montereau, /wiki/1942, /wiki/World_War_II, /wiki/Imperial_Japanese_Army, /wiki/Sook_Ching, /wiki/Chinese_Singaporeans, /wiki/2007, /wiki/2007_Samjhauta_Express_bombings, /wiki/Samjhauta_Express, /wiki/Panipat, /wiki/Michelangelo, /wiki/George_Henschel, /wiki/Sergo_Ordzhonikidze, /wiki/February_17, /wiki/February_18, /wiki/February_19, /wiki/Wikipedia:Selected_anniversaries/February, https://lists.wikimedia.org/postorius/lists/daily-article-l.lists.wikimedia.org/, /wiki/List_of_days_of_the_year, /wiki/File:Daresbury_church_tower.jpg, /wiki/All_Saints%27_Church,_Daresbury, /wiki/Listed_buildings_in_Runcorn_(rural_area), /wiki/Runcorn, /wiki/Borough_of_Halton, /wiki/Cheshire, /wiki/Listed_building, /wiki/Rocksavage, /wiki/Telephone_booth, /wiki/Daresbury, /wiki/Court, /wiki/Listed_buildings_in_Runcorn_(rural_area), /wiki/Robert_Bathurst_filmography, /wiki/List_of_alumni_of_Jesus_College,_Oxford, /wiki/Andre_Norton_Award, /wiki/Wikipedia:Today%27s_featured_list/February_2022, /wiki/Wikipedia:Featured_lists, /wiki/File:Giraffa_camelopardalis_head_(Profil).jpg, /wiki/Northern_giraffe, /wiki/North_Africa, /wiki/Ossicone, /wiki/Zoo_d%27Amn%C3%A9ville, https://commons.wikimedia.org/wiki/User:Ritchyblack, /wiki/Template:POTD/2022-02-17, /wiki/Template:POTD/2022-02-16, /wiki/Template:POTD/2022-02-15, /wiki/Wikipedia:Picture_of_the_day/Archive, /wiki/Wikipedia:Featured_pictures, /wiki/Wikipedia:Community_portal, /wiki/Wikipedia:Help_desk, /wiki/Wikipedia:Reference_desk, /wiki/Wikipedia:News, /wiki/Wikipedia:Teahouse, /wiki/Wikipedia:Village_pump, /wiki/Wikimedia_Foundation, https://wikimediafoundation.org/our-work/wikimedia-projects/, https://commons.wikimedia.org/wiki/, https://commons.wikimedia.org/wiki/, https://www.mediawiki.org/wiki/, https://www.mediawiki.org/wiki/, https://meta.wikimedia.org/wiki/, https://meta.wikimedia.org/wiki/, https://en.wikibooks.org/wiki/, https://en.wikibooks.org/wiki/, https://www.wikidata.org/wiki/, https://www.wikidata.org/wiki/, https://en.wikinews.org/wiki/, https://en.wikinews.org/wiki/, https://en.wikiquote.org/wiki/, https://en.wikiquote.org/wiki/, https://en.wikisource.org/wiki/, https://en.wikisource.org/wiki/, https://species.wikimedia.org/wiki/, https://species.wikimedia.org/wiki/, https://en.wikiversity.org/wiki/, https://en.wikiversity.org/wiki/, https://en.wikivoyage.org/wiki/, https://en.wikivoyage.org/wiki/, https://en.wiktionary.org/wiki/, https://en.wiktionary.org/wiki/, /wiki/English_language, https://meta.wikimedia.org/wiki/List_of_Wikipedias, https://ar.wikipedia.org/wiki/, https://de.wikipedia.org/wiki/, https://es.wikipedia.org/wiki/, https://fr.wikipedia.org/wiki/, https://it.wikipedia.org/wiki/, https://nl.wikipedia.org/wiki/, https://ja.wikipedia.org/wiki/, https://pl.wikipedia.org/wiki/, https://pt.wikipedia.org/wiki/, https://ru.wikipedia.org/wiki/, https://sv.wikipedia.org/wiki/, https://uk.wikipedia.org/wiki/, https://vi.wikipedia.org/wiki/, https://zh.wikipedia.org/wiki/, https://id.wikipedia.org/wiki/, https://ms.wikipedia.org/wiki/, https://zh-min-nan.wikipedia.org/wiki/, https://bg.wikipedia.org/wiki/, https://ca.wikipedia.org/wiki/, https://cs.wikipedia.org/wiki/, https://da.wikipedia.org/wiki/, https://eo.wikipedia.org/wiki/, https://eu.wikipedia.org/wiki/, https://fa.wikipedia.org/wiki/, https://he.wikipedia.org/wiki/, https://ko.wikipedia.org/wiki/, https://hu.wikipedia.org/wiki/, https://no.wikipedia.org/wiki/, https://ro.wikipedia.org/wiki/, https://sr.wikipedia.org/wiki/, https://sh.wikipedia.org/wiki/, https://fi.wikipedia.org/wiki/, https://tr.wikipedia.org/wiki/, https://ast.wikipedia.org/wiki/, https://bn.wikipedia.org/wiki/, https://bs.wikipedia.org/wiki/, https://et.wikipedia.org/wiki/, https://el.wikipedia.org/wiki/, https://simple.wikipedia.org/wiki/, https://gl.wikipedia.org/wiki/, https://hr.wikipedia.org/wiki/, https://lv.wikipedia.org/wiki/, https://lt.wikipedia.org/wiki/, https://ml.wikipedia.org/wiki/, https://mk.wikipedia.org/wiki/, https://nn.wikipedia.org/wiki/, https://sq.wikipedia.org/wiki/, https://sk.wikipedia.org/wiki/, https://sl.wikipedia.org/wiki/, https://th.wikipedia.org/wiki/, https://en.wikipedia.org/w/index.php?title=Main_Page&oldid=1069328725, /wiki/Special:MyTalk, /wiki/Special:MyContributions, /w/index.php?title=Special:CreateAccount&returnto=Main+Page, /w/index.php?title=Special:UserLogin&returnto=Main+Page, /wiki/Main_Page, /wiki/Talk:Main_Page, /wiki/Main_Page, /w/index.php?title=Main_Page&action=edit, /w/index.php?title=Main_Page&action=history, /wiki/Main_Page, /wiki/Main_Page, /wiki/Wikipedia:Contents, /wiki/Portal:Current_events, /wiki/Special:Random, /wiki/Wikipedia:About, //en.wikipedia.org/wiki/Wikipedia:Contact_us, https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en, /wiki/Help:Contents, /wiki/Help:Introduction, /wiki/Wikipedia:Community_portal, /wiki/Special:RecentChanges, /wiki/Wikipedia:File_Upload_Wizard, /wiki/Special:WhatLinksHere/Main_Page, /wiki/Special:RecentChangesLinked/Main_Page, /wiki/Wikipedia:File_Upload_Wizard, /wiki/Special:SpecialPages, /w/index.php?title=Main_Page&oldid=1069328725, /w/index.php?title=Main_Page&action=info, /w/index.php?title=Special:CiteThisPage&page=Main_Page&id=1069328725&wpFormIdentifier=titleform, https://www.wikidata.org/wiki/Special:EntityPage/Q5296, /w/index.php?title=Special:DownloadAsPdf&page=Main_Page&action=show-download-screen, /w/index.php?title=Main_Page&printable=yes, https://commons.wikimedia.org/wiki/Main_Page, https://www.mediawiki.org/wiki/MediaWiki, https://meta.wikimedia.org/wiki/Main_Page, https://wikisource.org/wiki/Main_Page, https://species.wikimedia.org/wiki/Main_Page, https://en.wikibooks.org/wiki/Main_Page, https://www.wikidata.org/wiki/Wikidata:Main_Page, https://wikimania.wikimedia.org/wiki/Wikimania, https://en.wikinews.org/wiki/Main_Page, https://en.wikiquote.org/wiki/Main_Page, https://en.wikisource.org/wiki/Main_Page, https://en.wikiversity.org/wiki/Wikiversity:Main_Page, https://en.wikivoyage.org/wiki/Main_Page, https://en.wiktionary.org/wiki/Wiktionary:Main_Page, https://ar.wikipedia.org/wiki/, https://bn.wikipedia.org/wiki/, https://bg.wikipedia.org/wiki/, https://bs.wikipedia.org/wiki/, https://ca.wikipedia.org/wiki/, https://cs.wikipedia.org/wiki/, https://da.wikipedia.org/wiki/, https://de.wikipedia.org/wiki/, https://et.wikipedia.org/wiki/, https://el.wikipedia.org/wiki/, https://es.wikipedia.org/wiki/, https://eo.wikipedia.org/wiki/, https://eu.wikipedia.org/wiki/, https://fa.wikipedia.org/wiki/, https://fr.wikipedia.org/wiki/, https://gl.wikipedia.org/wiki/, https://ko.wikipedia.org/wiki/, https://hr.wikipedia.org/wiki/, https://id.wikipedia.org/wiki/, https://it.wikipedia.org/wiki/, https://he.wikipedia.org/wiki/, https://ka.wikipedia.org/wiki/, https://lv.wikipedia.org/wiki/, https://lt.wikipedia.org/wiki/, https://hu.wikipedia.org/wiki/, https://mk.wikipedia.org/wiki/, https://ms.wikipedia.org/wiki/, https://nl.wikipedia.org/wiki/, https://ja.wikipedia.org/wiki/, https://no.wikipedia.org/wiki/, https://nn.wikipedia.org/wiki/, https://pl.wikipedia.org/wiki/, https://pt.wikipedia.org/wiki/, https://ro.wikipedia.org/wiki/, https://ru.wikipedia.org/wiki/, https://simple.wikipedia.org/wiki/, https://sk.wikipedia.org/wiki/, https://sl.wikipedia.org/wiki/, https://sr.wikipedia.org/wiki/, https://sh.wikipedia.org/wiki/, https://fi.wikipedia.org/wiki/, https://sv.wikipedia.org/wiki/, https://th.wikipedia.org/wiki/, https://tr.wikipedia.org/wiki/, https://uk.wikipedia.org/wiki/, https://vi.wikipedia.org/wiki/, https://zh.wikipedia.org/wiki/, //en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License, //creativecommons.org/licenses/by-sa/3.0/, //foundation.wikimedia.org/wiki/Terms_of_Use, //foundation.wikimedia.org/wiki/Privacy_policy, //www.wikimediafoundation.org/, https://foundation.wikimedia.org/wiki/Privacy_policy, /wiki/Wikipedia:About, /wiki/Wikipedia:General_disclaimer, //en.wikipedia.org/wiki/Wikipedia:Contact_us, //en.m.wikipedia.org/w/index.php?title=Main_Page&mobileaction=toggle_view_mobile, https://www.mediawiki.org/wiki/Special:MyLanguage/How_to_contribute, https://stats.wikimedia.org/#/en.wikipedia.org, https://foundation.wikimedia.org/wiki/Cookie_statement, https://wikimediafoundation.org/, https://www.mediawiki.org/
## 4 navbar.html
# unnest() df2 to have a tidy dataframe
%>%
df2 unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(links)`
## # A tibble: 484 × 2
## urls links
## <chr> <chr>
## 1 https://thinkr.fr https://thinkr.fr
## 2 https://thinkr.fr #
## 3 https://thinkr.fr https://thinkr.fr/notre-vision-de-la-formation/
## 4 https://thinkr.fr https://thinkr.fr/notre-vision-de-la-formation/
## 5 https://thinkr.fr https://thinkr.fr/equipe/
## 6 https://thinkr.fr https://thinkr.fr/blog/
## 7 https://thinkr.fr https://thinkr.fr/faq/
## 8 https://thinkr.fr https://thinkr.fr/recrutement/
## 9 https://thinkr.fr https://thinkr.fr/equipe/
## 10 https://thinkr.fr https://thinkr.fr/diagnostic-et-accompagnement-a-lexploita…
## # … with 474 more rows
Well, you’ve aced this chapter on programming with purrr! Just imagine
how many more lines of code you would have needed to get the list of all
links without the tools from purrr
. Now that you’ve got a good grasp of the potential of purrr
, we’ll end this course with a case-study using a real life dataset.