Functional Programming with purrr
DataCamp
Course Description
Lists can be difficult to both understand and manipulate, but they can pack a ton of information and are very powerful. In this course, you will learn to easily extract, summarize, and manipulate lists and how to export the data to your desired object, be it another list, a vector, or even something else! Throughout the course, you will work with the purrr package and a variety of datasets from the repurrrsive package, including data from Star Wars and Wes Anderson films and data collected about GitHub users and GitHub repos. Following this course, your list skills will be purrrfect!
1 Simplifying with purrr
Iteration is a powerful way to make the computer do the work for you. It can also be an area of coding where it is easy to make lots of typos and simple mistakes. The purrr package helps simplify iteration so you can focus on the next step, instead of finding typos.
1.1 The power of iteration
1.1.1 Introduction to iteration
Imagine that you need to read in hundreds of files with a similar structure and perform an action on them. You don’t want to write hundreds of repetitive lines of code to read in all the files or to perform the action. Instead, you want to iterate over them. Iteration is the process of doing the same process to multiple inputs. Being able to iterate is important to make your code efficient, and is powerful when working with lists.
For this exercise, the names of 16 CSV files have been loaded into a list called files
. In your own work, you could use the list.files()
function to create this list. The readr
library is also already loaded.
This course touches on a lot of concepts you may have forgotten, so if you ever need a quick refresher, download the tidyverse Cheat Sheet and keep it handy!
files
list, and gives each element as an input for readr::read_csv()
, which is another way of saying the read_csv()
function from the readr
package.
# Initialize list
<- list() all_files
all_files
list.
=list.files("/Users/apple/Documents/Rstudio/DataCamp/FoundationsofFunctionalProgrammingwithpurrr/simulated_data_from_1990_to_2005", pattern = "*.csv")
files=paste("/Users/apple/Documents/Rstudio/DataCamp/FoundationsofFunctionalProgrammingwithpurrr/simulated_data_from_1990_to_2005/",files,sep="")
files# For loop to read files into a list
for(i in seq_along(files)){
<- read_csv(files[[i]])
all_files[[i]] }
head(all_files)
## [[1]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1990 5.25 197.
## 2 1990 8.17 192.
## 3 1990 6.49 192.
## 4 1990 5.82 195.
## 5 1990 5.54 201.
## 6 1990 6.65 196.
## 7 1990 10.4 208.
## 8 1990 1.66 183.
## 9 1990 2.78 174.
## 10 1990 8.34 198.
## # … with 190 more rows
##
## [[2]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1991 3.70 197.
## 2 1991 5.37 187.
## 3 1991 7.05 186.
## 4 1991 1.97 207.
## 5 1991 8.05 217.
## 6 1991 1.97 213.
## 7 1991 5.33 195.
## 8 1991 4.32 204.
## 9 1991 4.46 177.
## 10 1991 4.63 222.
## # … with 190 more rows
##
## [[3]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1992 8.64 178.
## 2 1992 3.70 207.
## 3 1992 4.79 206.
## 4 1992 9.22 194.
## 5 1992 6.49 202.
## 6 1992 4.58 197.
## 7 1992 5.06 174.
## 8 1992 2.20 216.
## 9 1992 4.72 177.
## 10 1992 10.0 188.
## # … with 190 more rows
##
## [[4]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1993 2.34 204.
## 2 1993 5.44 167.
## 3 1993 6.86 213.
## 4 1993 5.70 197.
## 5 1993 2.78 193.
## 6 1993 3.24 164.
## 7 1993 5.59 234.
## 8 1993 3.02 183.
## 9 1993 4.60 182.
## 10 1993 7.56 205.
## # … with 190 more rows
##
## [[5]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1994 3.40 197.
## 2 1994 4.29 214.
## 3 1994 6.91 175.
## 4 1994 3.11 181.
## 5 1994 5.50 185.
## 6 1994 3.59 211.
## 7 1994 2.97 189.
## 8 1994 7.40 171.
## 9 1994 9.66 198.
## 10 1994 8.19 221.
## # … with 190 more rows
##
## [[6]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1995 5.12 197.
## 2 1995 4.18 219.
## 3 1995 3.70 186.
## 4 1995 4.46 204.
## 5 1995 7.48 209.
## 6 1995 8.38 204.
## 7 1995 4.51 202.
## 8 1995 5.68 208.
## 9 1995 5.24 211.
## 10 1995 3.04 212.
## # … with 190 more rows
all_files
list.
# Output size of list object
length(all_files)
## [1] 16
Good work! Now let’s see how to do it more easily with purrr.
1.1.2 Iteration with purrr
You’ve made a great for loop, but it uses a lot of code to do something
as simple as input a series of files into a list. This is where purrr
comes in. We can do the same thing as a for loop in one line of code with purrr::map()
. The function map()
iterates over a list, and uses another function that can specified with the .f
argument.
map()
takes two arguments:
- The first is the list over that will be iterated over
- The second is a function that will act on each element of the list
The readr
library is already loaded.
purrr
library (note the 3 Rs).
# Load purrr library
library(purrr)
map()
instead. Use the same list files
and the same function readr::read_csv()
.
# Use map to iterate
<- map(files, read_csv) all_files_purrr
head(all_files_purrr)
## [[1]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1990 5.25 197.
## 2 1990 8.17 192.
## 3 1990 6.49 192.
## 4 1990 5.82 195.
## 5 1990 5.54 201.
## 6 1990 6.65 196.
## 7 1990 10.4 208.
## 8 1990 1.66 183.
## 9 1990 2.78 174.
## 10 1990 8.34 198.
## # … with 190 more rows
##
## [[2]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1991 3.70 197.
## 2 1991 5.37 187.
## 3 1991 7.05 186.
## 4 1991 1.97 207.
## 5 1991 8.05 217.
## 6 1991 1.97 213.
## 7 1991 5.33 195.
## 8 1991 4.32 204.
## 9 1991 4.46 177.
## 10 1991 4.63 222.
## # … with 190 more rows
##
## [[3]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1992 8.64 178.
## 2 1992 3.70 207.
## 3 1992 4.79 206.
## 4 1992 9.22 194.
## 5 1992 6.49 202.
## 6 1992 4.58 197.
## 7 1992 5.06 174.
## 8 1992 2.20 216.
## 9 1992 4.72 177.
## 10 1992 10.0 188.
## # … with 190 more rows
##
## [[4]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1993 2.34 204.
## 2 1993 5.44 167.
## 3 1993 6.86 213.
## 4 1993 5.70 197.
## 5 1993 2.78 193.
## 6 1993 3.24 164.
## 7 1993 5.59 234.
## 8 1993 3.02 183.
## 9 1993 4.60 182.
## 10 1993 7.56 205.
## # … with 190 more rows
##
## [[5]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1994 3.40 197.
## 2 1994 4.29 214.
## 3 1994 6.91 175.
## 4 1994 3.11 181.
## 5 1994 5.50 185.
## 6 1994 3.59 211.
## 7 1994 2.97 189.
## 8 1994 7.40 171.
## 9 1994 9.66 198.
## 10 1994 8.19 221.
## # … with 190 more rows
##
## [[6]]
## # A tibble: 200 × 3
## years a b
## <dbl> <dbl> <dbl>
## 1 1995 5.12 197.
## 2 1995 4.18 219.
## 3 1995 3.70 186.
## 4 1995 4.46 204.
## 5 1995 7.48 209.
## 6 1995 8.38 204.
## 7 1995 4.51 202.
## 8 1995 5.68 208.
## 9 1995 5.24 211.
## 10 1995 3.04 212.
## # … with 190 more rows
all_files_purrr
.
# Output size of list object
length(all_files_purrr)
## [1] 16
Nice! You can see from the output here that 16 different files have been read into all_files_purrr
.
1.1.3 More iteration with for loops
Iteration isn’t just for reading in files though; iteration can be used to perform other actions on objects. First, you will try iterating with a for loop.
You’re going to change each element of a list into a numeric data type and then put it back into the same element in the same list.
For this exercise, you will iterate using a for loop that takes list_of_df
,
which is a list of character vector, but the characters are actually
numbers! You need to change the character vectors to numeric so that you
can perform mathematical operations on them; you can use the base R
function, as.numeric()
to do that.
list_of_df
.
=lapply(1:10,function(x){1:4})
list_of_df# Check the class type of the first element
class(list_of_df[[1]])
## [1] "integer"
list_of_df
, changes it into numeric data with as.numeric()
, and adds it back into the same element of list_of_df
.
# Change each element from a character to a number
for(i in seq_along(list_of_df)){
<- as.numeric(list_of_df[[i]])
list_of_df[[i]] }
list_of_df
.
# Check the class type of the first element
class(list_of_df[[1]])
## [1] "numeric"
list_of_df
.
# Print out the list
head(list_of_df)
## [[1]]
## [1] 1 2 3 4
##
## [[2]]
## [1] 1 2 3 4
##
## [[3]]
## [1] 1 2 3 4
##
## [[4]]
## [1] 1 2 3 4
##
## [[5]]
## [1] 1 2 3 4
##
## [[6]]
## [1] 1 2 3 4
Nice! You can see from the output that we have a list of numbers now!
1.1.4 More iteration with purrr
Now you will change each element of a list into a numeric data type and
then put it back into the same element in the same list, but instead of
using a for loop, you’ll use map()
.
You can use the purrr
function map()
to more
easily loop over a list, and turn the characters into numbers. Instead
of having to build a whole for loop, you can use one line of code.
list_of_df
.
# Check the class type of the first element
class(list_of_df[[1]])
## [1] "numeric"
map()
to iterate over list_of_df
and change each element of the list into numeric data.
# Change each character element to a number
<- map(list_of_df, as.numeric) list_of_df
list_of_df
.
# Check the class type of the first element again
class(list_of_df[[1]])
## [1] "numeric"
list_of_df
.
# Print out the list
head(list_of_df)
## [[1]]
## [1] 1 2 3 4
##
## [[2]]
## [1] 1 2 3 4
##
## [[3]]
## [1] 1 2 3 4
##
## [[4]]
## [1] 1 2 3 4
##
## [[5]]
## [1] 1 2 3 4
##
## [[6]]
## [1] 1 2 3 4
Good work! Now you can fix class type issues in your lists!
1.2 Subsetting lists
1.2.1 Subsetting lists
Often when working in R, you’ll use dataframes or vectors. Another kind of R object is a list. While lists can be complicated, lists are also incredibly powerful. Lists are like Hermione Granger’s bag of holding (from Harry Potter); they can hold a wide variety of things. The contents of a list don’t have to be the same data type, and as long as you know how it’s organized, you can grab out what you need by subsetting.
Both named and unnamed lists can be subset using double square brackets [[ ]]
list this: listname[[ index ]]
If a list is named, you can also use $
for subsetting. The syntax list$elementname
pulls out the named element from the list. Like any other kind of object in R, you can use the str()
to determine the structure of the list.
repurrrsive
package.
# Load repurrrsive package, to get access to the wesanderson dataset
library(repurrrsive)
wesanderson
dataset.
# Load wesanderson dataset
data(wesanderson)
wesanderson
.
# Get structure of first element in wesanderson
str(wesanderson[[1]])
## chr [1:4] "#F1BB7B" "#FD6467" "#5B1A18" "#D67236"
GrandBudapest
element in wesanderson
.
# Get structure of GrandBudapest element in wesanderson
str(wesanderson$GrandBudapest)
## chr [1:4] "#F1BB7B" "#FD6467" "#5B1A18" "#D67236"
Good work! Now you can subset and determine the structure of each part of a named or unnamed list!
1.2.2 Subsetting list elements
You can also subset within list elements using bracket notation like this: ListName$ElementName[VectorNumber]
. If a list element is a dataframe, you can pull out a column like this: ListName$ElementName$ColumnName
or ListName[[1]][,1]
.
In this exercise, you’ll examine the wesanderson
and sw_films
datasets from the repurrrsive
package. wesanderson
contains color palettes for each of Wes Anderson’s movies. These colors
are recorded in hexadecimal, that is, a # followed by six digits that
indicate a particular color. Here, you will be using two ways of pulling
out a particular color hexadecimal.
sw_films
contains information about the films in the Star
Wars franchise, such as title, director, producer, etc. You’ll use
subsetting to explore this dataset.
Subset the third color from the first element of wesanderson
. Then subset the fourth color from GrandBudapest
.
# Third element of the first wesanderson vector
1]][3] wesanderson[[
## [1] "#5B1A18"
# Fourth element of the GrandBudapest wesanderson vector
$GrandBudapest[4] wesanderson
## [1] "#D67236"
Subset the first element from sw_films
. Then subset the title element from the first element.
# Subset the first element of the sw_films data
1]] sw_films[[
## $title
## [1] "A New Hope"
##
## $episode_id
## [1] 4
##
## $opening_crawl
## [1] "It is a period of civil war.\r\nRebel spaceships, striking\r\nfrom a hidden base, have won\r\ntheir first victory against\r\nthe evil Galactic Empire.\r\n\r\nDuring the battle, Rebel\r\nspies managed to steal secret\r\nplans to the Empire's\r\nultimate weapon, the DEATH\r\nSTAR, an armored space\r\nstation with enough power\r\nto destroy an entire planet.\r\n\r\nPursued by the Empire's\r\nsinister agents, Princess\r\nLeia races home aboard her\r\nstarship, custodian of the\r\nstolen plans that can save her\r\npeople and restore\r\nfreedom to the galaxy...."
##
## $director
## [1] "George Lucas"
##
## $producer
## [1] "Gary Kurtz, Rick McCallum"
##
## $release_date
## [1] "1977-05-25"
##
## $characters
## [1] "http://swapi.co/api/people/1/" "http://swapi.co/api/people/2/"
## [3] "http://swapi.co/api/people/3/" "http://swapi.co/api/people/4/"
## [5] "http://swapi.co/api/people/5/" "http://swapi.co/api/people/6/"
## [7] "http://swapi.co/api/people/7/" "http://swapi.co/api/people/8/"
## [9] "http://swapi.co/api/people/9/" "http://swapi.co/api/people/10/"
## [11] "http://swapi.co/api/people/12/" "http://swapi.co/api/people/13/"
## [13] "http://swapi.co/api/people/14/" "http://swapi.co/api/people/15/"
## [15] "http://swapi.co/api/people/16/" "http://swapi.co/api/people/18/"
## [17] "http://swapi.co/api/people/19/" "http://swapi.co/api/people/81/"
##
## $planets
## [1] "http://swapi.co/api/planets/2/" "http://swapi.co/api/planets/3/"
## [3] "http://swapi.co/api/planets/1/"
##
## $starships
## [1] "http://swapi.co/api/starships/2/" "http://swapi.co/api/starships/3/"
## [3] "http://swapi.co/api/starships/5/" "http://swapi.co/api/starships/9/"
## [5] "http://swapi.co/api/starships/10/" "http://swapi.co/api/starships/11/"
## [7] "http://swapi.co/api/starships/12/" "http://swapi.co/api/starships/13/"
##
## $vehicles
## [1] "http://swapi.co/api/vehicles/4/" "http://swapi.co/api/vehicles/6/"
## [3] "http://swapi.co/api/vehicles/7/" "http://swapi.co/api/vehicles/8/"
##
## $species
## [1] "http://swapi.co/api/species/5/" "http://swapi.co/api/species/3/"
## [3] "http://swapi.co/api/species/2/" "http://swapi.co/api/species/1/"
## [5] "http://swapi.co/api/species/4/"
##
## $created
## [1] "2014-12-10T14:23:31.880000Z"
##
## $edited
## [1] "2015-04-11T09:46:52.774897Z"
##
## $url
## [1] "http://swapi.co/api/films/1/"
# Subset the first element of the sw_films data, the title column
1]]$title sw_films[[
## [1] "A New Hope"
Great work, now you should be very comfortable subsetting lists!
1.3 The many flavors of map()
1.3.1 map() argument alternatives
You can also use iteration to answer a question, like how long is each element in the wesanderson
dataset. You can do this by feeding map()
a function like length()
. You can do this using the map(list, function)
syntax and it works just fine. However, future exercises get more
complex, you will need to learn how to do this second way, using:
map(list, ~function(.x))
This second way gives the same result as map(list, function)
. To specify how the list is used in the function, use the argument .x
to denote where the list element goes inside the function. When you want to use .x
to show where the element goes in the function, you need to put a ~
in front of the function in the second argument of map()
.
Use map()
on wesanderson
and determine the length of each element in the “old” way.
# Map over wesanderson to get the length of each element
map(wesanderson, length)
## $GrandBudapest
## [1] 4
##
## $Moonrise1
## [1] 4
##
## $Royal1
## [1] 4
##
## $Moonrise2
## [1] 4
##
## $Cavalcanti
## [1] 5
##
## $Royal2
## [1] 5
##
## $GrandBudapest2
## [1] 4
##
## $Moonrise3
## [1] 5
##
## $Chevalier
## [1] 4
##
## $Zissou
## [1] 5
##
## $FantasticFox
## [1] 5
##
## $Darjeeling
## [1] 5
##
## $Rushmore
## [1] 5
##
## $BottleRocket
## [1] 7
##
## $Darjeeling2
## [1] 5
Use map()
on wesanderson
and determine the length of each element again, but this time using map(list, ~function(.x))
.
# Map over wesanderson, and determine the length of each element
map(wesanderson, ~length(.x))
## $GrandBudapest
## [1] 4
##
## $Moonrise1
## [1] 4
##
## $Royal1
## [1] 4
##
## $Moonrise2
## [1] 4
##
## $Cavalcanti
## [1] 5
##
## $Royal2
## [1] 5
##
## $GrandBudapest2
## [1] 4
##
## $Moonrise3
## [1] 5
##
## $Chevalier
## [1] 4
##
## $Zissou
## [1] 5
##
## $FantasticFox
## [1] 5
##
## $Darjeeling
## [1] 5
##
## $Rushmore
## [1] 5
##
## $BottleRocket
## [1] 7
##
## $Darjeeling2
## [1] 5
Great Job! This new way of writing map_*()
functions will come in handy in future exercises, so make a mental note of the ~
and the .x
argument.
1.3.2 map_*
The map()
function will return its output as a list. However, there are several different map()
functions; you can use map_()
functions to tell purrr
the type of output you want. The in map_*()
represents different R data types. For instance, you might want the
output to be a vector of numbers so that we can put it inside a
dataframe. So, unless you want something to be returned as a list, you
need to determine what you want the output to be before you write your map()
function.
-
Determine the length of each element of the
wesanderson
dataset using our originalmap()
function. Examine the output.
# Map over wesanderson, to determine the length of each element
map(wesanderson, length)
## $GrandBudapest
## [1] 4
##
## $Moonrise1
## [1] 4
##
## $Royal1
## [1] 4
##
## $Moonrise2
## [1] 4
##
## $Cavalcanti
## [1] 5
##
## $Royal2
## [1] 5
##
## $GrandBudapest2
## [1] 4
##
## $Moonrise3
## [1] 5
##
## $Chevalier
## [1] 4
##
## $Zissou
## [1] 5
##
## $FantasticFox
## [1] 5
##
## $Darjeeling
## [1] 5
##
## $Rushmore
## [1] 5
##
## $BottleRocket
## [1] 7
##
## $Darjeeling2
## [1] 5
-
Create a dataframe that has the number of colors from each movie, using
map_dbl()
. The dbl means a double or a number that can have a decimal.
# Create a numcolors column and fill with length of each wesanderson element
data.frame(numcolors = map_dbl(wesanderson, ~length(.x)))
## numcolors
## GrandBudapest 4
## Moonrise1 4
## Royal1 4
## Moonrise2 4
## Cavalcanti 5
## Royal2 5
## GrandBudapest2 4
## Moonrise3 5
## Chevalier 4
## Zissou 5
## FantasticFox 5
## Darjeeling 5
## Rushmore 5
## BottleRocket 7
## Darjeeling2 5
Good work! Notice how much cleaner the output was using map_dbl()
! It’s always worth thinking through which map_*()
function will get you where to need to go before coding it out. In our
next chapter, we’ll dive into more complex uses of purrr.
2 More complex iterations
purrr is much more than a for loop; it works well with pipes, we can use it to run models and simulate data, and make nested loops!
2.1 Working with unnamed lists
2.1.1 Names & pipe refresher
It is easy to determine if a list has names using names()
.
Understanding the named elements of a list can make working with the
list elements easier because you can pull out the information you need
by name, instead of searching for the correct numbered element.
purrr
is a part of the tidyverse, a system of packages
designed to be used together, and used with pipes. Let’s do a quick
refresh on how pipes work. A pipe %>%
takes the output
from the function that comes before it, and feeds it into the function
that comes after the pipe as its first argument.
function_before() %>%
function_after()
You don’t need to use pipes when you use purrr
functions, but for the purposes of these lessons, you will be.
-
Check to see if the
sw_films
list has named elements with pipes.
# Use pipes to check for names in sw_films
%>%
sw_films names()
## NULL
Good work! Now that you know how to check to see if a list has names in a tidy way, you’re ready to dive in.
2.1.2 Setting names
If you have an unnamed list, you can, of course, name each element. This
can be very useful for being able to call out certain elements in a
list, regardless of their order, especially if you are working with a
list that may grow or change over time, or if you use the same code on
several different lists. For instance, if you have a list that contains,
a dataframe, a model, and a plot, being able to call out $plot
instead of searching to figure out what numbered element of the plot, is much easier.
-
Name each element of
sw_films
list and assign to a new list,sw_films_named
. - Iterate over the title element.
# Set names so each element of the list is named for the film title
<- sw_films %>%
sw_films_named set_names(map_chr(sw_films, "title"))
# Check to see if the names worked/are correct
names(sw_films_named)
## [1] "A New Hope" "Attack of the Clones"
## [3] "The Phantom Menace" "Revenge of the Sith"
## [5] "Return of the Jedi" "The Empire Strikes Back"
## [7] "The Force Awakens"
Good work! Naming lists makes working in purrr
easier and more human-readable.
2.1.3 Pipes in map()
So you’ve refreshed your memory on how pipes can be used between functions. You can also use pipes on the inside of map()
function to help you iterate a pipeline of tasks over a list of inputs.
Here instead of using one of the repurrrsive
datasets, you will be working with a list of numbers so that you can do a few mathematical operations.
# Create a list of values from 1 through 10
<- list(1,2,3,4,5,6,7,8,9,10) numlist
map()
function that takes the sqrt()
of each element, and then the sin()
of each element.
# Iterate over the numlist
map(numlist, ~.x %>% sqrt() %>% sin()) %>% head()
## [[1]]
## [1] 0.841471
##
## [[2]]
## [1] 0.9877659
##
## [[3]]
## [1] 0.9870266
##
## [[4]]
## [1] 0.9092974
##
## [[5]]
## [1] 0.7867491
##
## [[6]]
## [1] 0.6381576
Good work! Using pipes inside of map()
makes iterating over multiple functions easy.
2.2 More map()
2.2.1 Simulating Data with Purrr
Often when trying to solve a problem with data we first need to build some simulated data to see if our idea is even possible. For example, you may want to test models with data that have known differences, to see if the models are working correctly.
In this exercise, you will see how this works in purrr
by simulating data for two populations, a
and b
,
from the sites: “north”, “east”, and “west”. The two populations will
be randomly drawn from a normal distribution, with different means and
standard deviations.
# List of sites north, east, and west
<- list("north","east","west") sites
map()
to create a list of dataframes with three columns, the first column is sites.
-
The second is population
a
, which has amean
of 5, a sample sizen
of 200, and ansd
of (5/2). -
The third is population
b
, which has amean
of 200, a sample sizen
of 200, and ansd
of 15.
# Create a list of dataframes, each with a years, a, and b column
<- map(sites,
list_of_df ~data.frame(sites = .x,
a = rnorm(mean = 5, n = 200, sd = (5/2)),
b = rnorm(mean = 200, n = 200, sd = 15)))
map(list_of_df,~head(.x))
## [[1]]
## sites a b
## 1 north 6.671339 197.9598
## 2 north 10.090051 212.0460
## 3 north 4.785466 185.1392
## 4 north 4.930145 233.7320
## 5 north 4.438924 205.4188
## 6 north -1.017154 212.5341
##
## [[2]]
## sites a b
## 1 east 7.099261 160.7708
## 2 east 0.112617 185.9312
## 3 east 3.535525 186.0787
## 4 east 2.920502 205.4623
## 5 east 6.971227 183.9190
## 6 east 6.388077 195.9712
##
## [[3]]
## sites a b
## 1 west 9.524335 186.3719
## 2 west 5.210117 199.9047
## 3 west 6.457159 201.0256
## 4 west 2.181551 215.0170
## 5 west 4.436093 195.9054
## 6 west 8.011461 189.4125
Good work! Now you can simulate data with ease.
2.2.2 Run a linear model
You can use map()
to do more than just take the square root of a number or simulate data. You can also use map()
to loop over different inputs to run several models, each using the
unique values of a given list element. You can also then iterate over
the models you’ve run to create the model summaries and look at the
results.
The lists sites
and list_of_df
are preloaded.
-
Pipe
list_of_df
intomap()
along with thelm()
linear model function, to comparea
as the response andb
as the predictor variable.-
Use the syntax:
lm(response ~ predictor, data = )
-
Use the syntax:
-
Then pipe the linear model output into
map()
and generate thesummary()
of each model.
# Map over the models to look at the relationship of a vs b
%>%
list_of_df map(~ lm(a ~ b, data = .)) %>%
map(summary)
## [[1]]
##
## Call:
## lm(formula = a ~ b, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9401 -1.9836 -0.1301 1.6425 5.7177
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.86981 2.27909 3.014 0.00291 **
## b -0.00916 0.01139 -0.804 0.42211
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.445 on 198 degrees of freedom
## Multiple R-squared: 0.003258, Adjusted R-squared: -0.001776
## F-statistic: 0.6471 on 1 and 198 DF, p-value: 0.4221
##
##
## [[2]]
##
## Call:
## lm(formula = a ~ b, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.261 -1.462 0.050 1.651 6.573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.73189 2.19967 -0.333 0.7397
## b 0.02786 0.01104 2.524 0.0124 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.479 on 198 degrees of freedom
## Multiple R-squared: 0.03117, Adjusted R-squared: 0.02627
## F-statistic: 6.37 on 1 and 198 DF, p-value: 0.01239
##
##
## [[3]]
##
## Call:
## lm(formula = a ~ b, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8803 -1.7816 0.1036 1.8021 5.6971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.895e+00 2.294e+00 2.134 0.0341 *
## b -9.945e-05 1.146e-02 -0.009 0.9931
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.407 on 198 degrees of freedom
## Multiple R-squared: 3.803e-07, Adjusted R-squared: -0.00505
## F-statistic: 7.53e-05 on 1 and 198 DF, p-value: 0.9931
Good work! This will make running multiple models and summarizing their results much easier.
2.2.3 map_chr()
In this exercise, you’ll dive a bit deeper into the different map_()
variants. The map()
function always outputs a list. map_
()
outputs other kinds of information. Study the table below and make sure you’re clear on the type of output for each map_*()
variant.
map_*()
|
Output |
---|---|
map_chr()
|
character vector |
map_lgl()
|
logical vector [TRUE or FALSE] |
map_int()
|
integer vector |
map_dbl()
|
double vector |
-
Compare the results of
map()
andmap_chr()
for thedirector
named elementsw_films
.
# Pull out the director element of sw_films in a list and character vector
map(sw_films, ~.x[["director"]])
## [[1]]
## [1] "George Lucas"
##
## [[2]]
## [1] "George Lucas"
##
## [[3]]
## [1] "George Lucas"
##
## [[4]]
## [1] "George Lucas"
##
## [[5]]
## [1] "Richard Marquand"
##
## [[6]]
## [1] "Irvin Kershner"
##
## [[7]]
## [1] "J. J. Abrams"
map_chr(sw_films, ~.x[["director"]])
## [1] "George Lucas" "George Lucas" "George Lucas" "George Lucas"
## [5] "Richard Marquand" "Irvin Kershner" "J. J. Abrams"
-
Compare the
map()
andmap_lgl()
outputs onsw_films
fordirector == George Lucas
.
# Compare outputs when checking if director is George Lucas
map(sw_films, ~.x[["director"]] == "George Lucas")
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] FALSE
##
## [[6]]
## [1] FALSE
##
## [[7]]
## [1] FALSE
map_lgl(sw_films, ~.x[["director"]] == "George Lucas")
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE
Good work! Mastering the flavors of map_*()
is key for success in purrr
.
2.2.4 map_dbl() and map_int()
Some flavors of map_()
are very similar. map_dbl()
and map_int()
both output numbers. map_int()
outputs integer vectors, which have numbers with no decimals. map_dbl()
outputs double vectors, which have numbers that can have decimals. Take a closer look at how using different map_
()
functions affect outputs.
Here is the map_*()
table again as a reference.
map_*()
|
Output |
---|---|
map_chr()
|
character vector |
map_lgl()
|
logical vector [TRUE or FALSE] |
map_int()
|
integer vector |
map_dbl()
|
double vector |
Compare the map()
and map_dbl()
outputs for pulling out the episode_id
for each element of sw_films
.
# Pull out episode_id element as list
map(sw_films, ~.x[["episode_id"]])
## [[1]]
## [1] 4
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 1
##
## [[4]]
## [1] 3
##
## [[5]]
## [1] 6
##
## [[6]]
## [1] 5
##
## [[7]]
## [1] 7
# Pull out episode_id element as double vector
map_dbl(sw_films, ~.x[["episode_id"]])
## [1] 4 2 1 3 6 5 7
Compare the map()
and map_int()
outputs for pulling out the episode_id
for each element of sw_films
.
# Pull out episode_id element as a list
map(sw_films, ~.x[["episode_id"]])
## [[1]]
## [1] 4
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 1
##
## [[4]]
## [1] 3
##
## [[5]]
## [1] 6
##
## [[6]]
## [1] 5
##
## [[7]]
## [1] 7
# Pull out episode_id element as integer vector
map_int(sw_films, ~.x[["episode_id"]])
## [1] 4 2 1 3 6 5 7
Good work! Now you can output numbers without decimals!
2.3 map2() and pmap()
2.3.1 Simulating data with multiple inputs using map2()
The map()
function is great if you need to iterate over one list, however, you will often need to iterate over two lists at the same time. This is where map2()
comes in. While map()
takes the list as the .x
argument; map2()
takes two lists as two arguments: .x
and .y
.
To test out map2()
, you are going to create a simple
dataset, with one list of numbers and one list of strings. You will put
these two lists together and create some simulated data.
means
, of the values 1 through 3.
# List of 1, 2 and 3
<- list(1,2,3) means
sites
list with “north”, “west”, and “east”.
# Create sites list
<- list("north","west","east") sites
map2()
over the sites
and means
lists to create a dataframe with two columns.
-
First column is
sites
; second column is generated byrnorm()
withmean
from themeans
list.
# Map over two arguments: sites and means
<- map2(sites, means, ~data.frame(sites = .x,
list_of_files_map2 a = rnorm(mean = .y, n = 200, sd = (5/2))))
map(list_of_files_map2,~head(.x))
## [[1]]
## sites a
## 1 north 3.449187
## 2 north 2.893941
## 3 north -2.361453
## 4 north 1.442438
## 5 north 1.414757
## 6 north 2.054845
##
## [[2]]
## sites a
## 1 west 2.1773931
## 2 west 1.8438938
## 3 west 4.9336391
## 4 west 3.2757952
## 5 west -0.2904645
## 6 west 2.6134759
##
## [[3]]
## sites a
## 1 east 2.297837
## 2 east 2.864035
## 3 east 3.616742
## 4 east 8.251796
## 5 east 3.199242
## 6 east 1.196774
Good work! Now you can you two lists together!
2.3.2 Simulating data 3+ inputs with pmap()
What if you need to iterate over three lists? Is there a map3()
? To iterate over more than two lists, whether it’s three, four, or even 20, you’ll need to use pmap()
. However, pmap()
does require us to supply our list arguments a bit differently.
To use pmap()
, you first need to create a master list of all the lists we want to iterate over. The master list is the input for pmap()
. Instead of using .x
or .y
, use the list names as the argument names.
You are going to simulate data one more time, using five lists as inputs, instead of two. Using pmap()
gives you complete control over our simulated dataset, and will allow
you to use two different means and two different standard deviations
along with the different sites.
sites
, means
, means2
, sigma
, and sigma2
lists.
=list(0.5,1,1.5)
means2=list(0.5,1,1.5)
sigma2=list(1,2,3)
sigma# Create a master list, a list of lists
<- list(sites = sites, means = means, sigma = sigma,
pmapinputs means2 = means2, sigma2 = sigma2)
pmap()
over the list of lists, to create a list of dataframes with three columns; the first column is sites
.
-
The second column is
a
, which isrnorm()
withmean = means
, andsd = sigma
. -
The third column is
b
, which isrnorm()
withmean = means2
, andsd = sigma2
.
# Create a master list, a list of lists
<- list(sites = sites, means = means, sigma = sigma,
pmapinputs means2 = means2, sigma2 = sigma2)
# Map over the master list
<- pmap(pmapinputs,
list_of_files_pmap function(sites, means, sigma, means2, sigma2){
data.frame(sites = sites,
a = rnorm(mean = means, n = 200, sd = sigma),
b = rnorm(mean = means2, n = 200, sd = sigma2))})
map(list_of_files_pmap,~head(.x))
## [[1]]
## sites a b
## 1 north 0.8789700 0.3855860
## 2 north -0.2245231 1.0029900
## 3 north 0.6417973 0.6355501
## 4 north 1.8780409 0.9760013
## 5 north 1.5165513 -0.1304455
## 6 north 2.4963962 1.1369883
##
## [[2]]
## sites a b
## 1 west -0.09834419 1.30693846
## 2 west -0.64468010 0.57628770
## 3 west 4.81134596 -0.01585508
## 4 west -0.85907440 -0.18470665
## 5 west 0.47639746 0.11106034
## 6 west 2.02665430 1.06197220
##
## [[3]]
## sites a b
## 1 east 5.537842 2.2437003
## 2 east 2.314830 1.1598322
## 3 east 1.287959 2.8198972
## 4 east 9.464502 1.1001475
## 5 east -1.857650 2.6695855
## 6 east 4.580386 0.1986446
Good work! With pmap()
you now have all the power in purrr
.
3 Troubleshooting lists with purrr
Like anything in R, understanding how to troubleshoot issues is an important skill set. This can be particularly important with lists, where finding the problem can be tricky.
3.1 How to purrr safely()
3.1.1 safely() replace with NA
If you map()
over a list, and one of the elements does not
have the right data type, you will not get the output you expect.
Perhaps you are trying to do a mathematical operation on each element,
and it turns out one of the elements is a character - it simply won’t
work.
If you have a very large list, figuring out where things went wrong, and what exactly went wrong can be hard. That is where safely()
comes in; it shows you both your results and where the errors occurred in your map()
call.
-
Use
safely()
withlog()
. This will fail to work on -10, so we’ll pipe it intotranspose()
to put the results first.
# Map safely over log
<- list(-10, 1, 10, 0) %>%
a map(safely(log, otherwise = NA_real_)) %>%
# Transpose the result
transpose()
## Warning in .f(...): NaNs produced
-
Print out
a
.
# Print the list
a
## $result
## $result[[1]]
## [1] NaN
##
## $result[[2]]
## [1] 0
##
## $result[[3]]
## [1] 2.302585
##
## $result[[4]]
## [1] -Inf
##
##
## $error
## $error[[1]]
## NULL
##
## $error[[2]]
## NULL
##
## $error[[3]]
## NULL
##
## $error[[4]]
## NULL
-
Print out the “result” element of
a
.
# Print the result element in the list
"result"]] a[[
## [[1]]
## [1] NaN
##
## [[2]]
## [1] 0
##
## [[3]]
## [1] 2.302585
##
## [[4]]
## [1] -Inf
-
Print out just the error messages from
a
.
# Print the error element in the list
"error"]] a[[
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
Good work! Now you have the power to start debugging your lists, and you can do it with simple element subsetting.
3.1.2 Convert data to numeric with purrr
In the sw_people
dataset, some of the Star Wars characters
have unknown heights. If you want to do some data exploration and
determine how character height differs depending on their home planet,
you need to write your code so that R understands the difference between
heights and missing values. Currently, the missing values are entered
as “unknown”
, but you would like them as NA
. In this exercise, you will combine map()
and ifelse()
to fix this issue.
sw_people
dataset.
# Load sw_people data
data(sw_people)
sw_people
and pull out “height”
.
NA
, otherwise, convert the value into a number with as.numeric()
.
# Map over sw_people and pull out the height element
<- map(sw_people, "height") %>%
height_cm map(function(x){
ifelse(x == "unknown",NA,
as.numeric(x))
})
Good work! Now you can use purrr
for data wrangling to help clean numeric data in lists.
3.1.3 Finding the problem areas
When you are working with a small list, it might not seem like a lot of work to go through things manually and figure out what element has an issue. But if you have a list with hundreds or thousands of elements, you want to automate that process.
Now you’ll look at a situation with a larger list, where you can see how the error message can be useful to check through the entire list for issues.
map()
over sw_people
and pull out the “height” element.
map()
over safely()
to convert the heights from centimeters into feet.
quiet = FALSE
so that errors are printed.
# Map over sw_people and pull out the height element
<- map(sw_people, "height") %>%
height_ft map(safely(function(x){
* 0.0328084
x quiet = FALSE)) %>%
}, transpose()
transpose()
, to print the results first.
# Print your list, the result element, and the error element
#height_ft
#height_ft[["result"]]
#height_ft[["error"]]
Good work! Now you are ready to troubleshoot lists too large to check by hand.
3.2 Another way to possibly() purrr
3.2.1 Replace safely() with possibly()
Once you have figured out how to solve an issue with safely()
, (e.g., output an NA
in place of an error), swap out safely()
with possibly()
. possibly()
will run through your code and implement your desired changes without printing out the error messages.
You’ll now map()
over log()
again, but you will use possibly()
instead of safely()
since you already know how to resolve your errors.
- Create a list with the values -10, 1, 10, and 0.
-
map()
over this list to take thelog()
of each element, usingpossibly()
. -
Use
NA_real_
to fix any elements that are not the right data type.
# Take the log of each element in the list
<- list(-10, 1, 10, 0) %>%
a map(possibly(function(x){
log(x)
NA_real_)) },
## Warning in log(x): NaNs produced
Good work! Now you can solve issues in lists using safely()
, and then continue with your analysis using possibly()
.
3.2.2 Convert values with possibly()
Let’s say you need to convert the Star Wars character heights in sw_people
from centimeters to feet. You already know that some of the heights have missing data, so you will use possibly()
to convert missing values into NA
. Then you will multiply each of the existing values by 0.0328084 to convert them from centimeters into feet.
To get a feel for your data, print out height_cm
in the console to check out the heights in centimeters.
-
Pipe the
height_cm
object into amap_*()
function that returns double vectors. -
Convert each element in
height_cm
into feet (multiply it by 0.0328084). -
Since not all elements are numeric, use
possibly()
to replace instances that do not work withNA_real_
.
# Create a piped workflow that returns double vectors
%>%
height_cm map_dbl(possibly(function(x){
# Convert centimeters to feet
* 0.0328084
x NA_real_)) },
## [1] 5.643045 5.479003 3.149606 6.627297 4.921260 5.839895 5.413386 3.182415
## [9] 6.003937 5.971129 6.167979 5.905512 7.480315 5.905512 5.675853 5.741470
## [17] 5.577428 5.905512 2.165354 5.577428 6.003937 6.561680 6.233596 5.807087
## [25] 5.741470 5.905512 4.921260 NA 2.887139 5.249344 6.332021 6.266404
## [33] 5.577428 6.430446 7.349082 6.758530 6.003937 4.494751 3.674541 6.003937
## [41] 5.347769 5.741470 5.905512 5.839895 3.083990 4.002625 5.347769 6.167979
## [49] 6.496063 6.430446 5.610236 6.036746 6.167979 8.661418 6.167979 6.430446
## [57] 6.069554 5.150919 6.003937 6.003937 5.577428 5.446194 5.413386 6.332021
## [65] 6.266404 6.003937 5.511811 6.496063 7.513124 6.988189 5.479003 2.591864
## [73] 3.149606 6.332021 6.266404 5.839895 7.086614 7.677166 6.167979 5.839895
## [81] 6.758530 NA NA NA NA NA 5.413386
Good work! Using possibly()
helps us work with problem data in a really clean and efficient way.
3.3 purrr is a walk() in the park
3.3.1 Comparing walk() vs no walk() outputs
Printing out lists with map()
shows a lot of bracketed text
in the console, which can be useful for understanding their structure,
but this information is usually not important for communicating with your end users. If you need to print, using walk()
prints out lists in a more compact and human-readable way, without all those brackets. walk()
is also great for printing out plots without printing anything to the console.
Here, you’ll be using the people_by_film
dataset, which dataset derived from sw_films
that has the url of each character and the film they appear in.
Print people_by_film
to the console.
# Print normally
=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRObsvb_OQ7qeXRvkTEbWBbQcYfyebglhoxAt9cIdRzH7Exf5s-mMqSgtjkHC0qNgK4PVsku7Q0bwfS/pub?gid=0&single=true&output=csv")
people_by_film%>% head() people_by_film
## url film_url
## 1 http://swapi.co/api/people/1/ http://swapi.co/api/films/6/
## 2 http://swapi.co/api/people/1/ http://swapi.co/api/films/3/
## 3 http://swapi.co/api/people/1/ http://swapi.co/api/films/2/
## 4 http://swapi.co/api/people/1/ http://swapi.co/api/films/1/
## 5 http://swapi.co/api/people/1/ http://swapi.co/api/films/7/
## 6 http://swapi.co/api/people/2/ http://swapi.co/api/films/5/
Print out people_by_film
using walk()
and print()
.
# Print with walk
walk(people_by_film, print)
## [1] "http://swapi.co/api/people/1/" "http://swapi.co/api/people/1/"
## [3] "http://swapi.co/api/people/1/" "http://swapi.co/api/people/1/"
## [5] "http://swapi.co/api/people/1/" "http://swapi.co/api/people/2/"
## [7] "http://swapi.co/api/people/2/" "http://swapi.co/api/people/2/"
## [9] "http://swapi.co/api/people/2/" "http://swapi.co/api/people/2/"
## [11] "http://swapi.co/api/people/2/" "http://swapi.co/api/people/3/"
## [13] "http://swapi.co/api/people/3/" "http://swapi.co/api/people/3/"
## [15] "http://swapi.co/api/people/3/" "http://swapi.co/api/people/3/"
## [17] "http://swapi.co/api/people/3/" "http://swapi.co/api/people/3/"
## [19] "http://swapi.co/api/people/4/" "http://swapi.co/api/people/4/"
## [21] "http://swapi.co/api/people/4/" "http://swapi.co/api/people/4/"
## [23] "http://swapi.co/api/people/5/" "http://swapi.co/api/people/5/"
## [25] "http://swapi.co/api/people/5/" "http://swapi.co/api/people/5/"
## [27] "http://swapi.co/api/people/5/" "http://swapi.co/api/people/6/"
## [29] "http://swapi.co/api/people/6/" "http://swapi.co/api/people/6/"
## [31] "http://swapi.co/api/people/7/" "http://swapi.co/api/people/7/"
## [33] "http://swapi.co/api/people/7/" "http://swapi.co/api/people/8/"
## [35] "http://swapi.co/api/people/9/" "http://swapi.co/api/people/10/"
## [37] "http://swapi.co/api/people/10/" "http://swapi.co/api/people/10/"
## [39] "http://swapi.co/api/people/10/" "http://swapi.co/api/people/10/"
## [41] "http://swapi.co/api/people/10/" "http://swapi.co/api/people/11/"
## [43] "http://swapi.co/api/people/11/" "http://swapi.co/api/people/11/"
## [45] "http://swapi.co/api/people/12/" "http://swapi.co/api/people/12/"
## [47] "http://swapi.co/api/people/13/" "http://swapi.co/api/people/13/"
## [49] "http://swapi.co/api/people/13/" "http://swapi.co/api/people/13/"
## [51] "http://swapi.co/api/people/13/" "http://swapi.co/api/people/14/"
## [53] "http://swapi.co/api/people/14/" "http://swapi.co/api/people/14/"
## [55] "http://swapi.co/api/people/14/" "http://swapi.co/api/people/15/"
## [57] "http://swapi.co/api/people/16/" "http://swapi.co/api/people/16/"
## [59] "http://swapi.co/api/people/16/" "http://swapi.co/api/people/18/"
## [61] "http://swapi.co/api/people/18/" "http://swapi.co/api/people/18/"
## [63] "http://swapi.co/api/people/19/" "http://swapi.co/api/people/20/"
## [65] "http://swapi.co/api/people/20/" "http://swapi.co/api/people/20/"
## [67] "http://swapi.co/api/people/20/" "http://swapi.co/api/people/20/"
## [69] "http://swapi.co/api/people/21/" "http://swapi.co/api/people/21/"
## [71] "http://swapi.co/api/people/21/" "http://swapi.co/api/people/21/"
## [73] "http://swapi.co/api/people/21/" "http://swapi.co/api/people/22/"
## [75] "http://swapi.co/api/people/22/" "http://swapi.co/api/people/22/"
## [77] "http://swapi.co/api/people/23/" "http://swapi.co/api/people/24/"
## [79] "http://swapi.co/api/people/25/" "http://swapi.co/api/people/25/"
## [81] "http://swapi.co/api/people/26/" "http://swapi.co/api/people/27/"
## [83] "http://swapi.co/api/people/27/" "http://swapi.co/api/people/28/"
## [85] "http://swapi.co/api/people/29/" "http://swapi.co/api/people/30/"
## [87] "http://swapi.co/api/people/31/" "http://swapi.co/api/people/32/"
## [89] "http://swapi.co/api/people/33/" "http://swapi.co/api/people/33/"
## [91] "http://swapi.co/api/people/33/" "http://swapi.co/api/people/34/"
## [93] "http://swapi.co/api/people/36/" "http://swapi.co/api/people/36/"
## [95] "http://swapi.co/api/people/37/" "http://swapi.co/api/people/38/"
## [97] "http://swapi.co/api/people/39/" "http://swapi.co/api/people/40/"
## [99] "http://swapi.co/api/people/40/" "http://swapi.co/api/people/41/"
## [101] "http://swapi.co/api/people/42/" "http://swapi.co/api/people/43/"
## [103] "http://swapi.co/api/people/43/" "http://swapi.co/api/people/44/"
## [105] "http://swapi.co/api/people/45/" "http://swapi.co/api/people/46/"
## [107] "http://swapi.co/api/people/46/" "http://swapi.co/api/people/46/"
## [109] "http://swapi.co/api/people/48/" "http://swapi.co/api/people/49/"
## [111] "http://swapi.co/api/people/50/" "http://swapi.co/api/people/51/"
## [113] "http://swapi.co/api/people/51/" "http://swapi.co/api/people/51/"
## [115] "http://swapi.co/api/people/52/" "http://swapi.co/api/people/52/"
## [117] "http://swapi.co/api/people/52/" "http://swapi.co/api/people/53/"
## [119] "http://swapi.co/api/people/53/" "http://swapi.co/api/people/53/"
## [121] "http://swapi.co/api/people/54/" "http://swapi.co/api/people/54/"
## [123] "http://swapi.co/api/people/55/" "http://swapi.co/api/people/55/"
## [125] "http://swapi.co/api/people/56/" "http://swapi.co/api/people/56/"
## [127] "http://swapi.co/api/people/57/" "http://swapi.co/api/people/58/"
## [129] "http://swapi.co/api/people/58/" "http://swapi.co/api/people/58/"
## [131] "http://swapi.co/api/people/59/" "http://swapi.co/api/people/59/"
## [133] "http://swapi.co/api/people/60/" "http://swapi.co/api/people/61/"
## [135] "http://swapi.co/api/people/62/" "http://swapi.co/api/people/63/"
## [137] "http://swapi.co/api/people/63/" "http://swapi.co/api/people/64/"
## [139] "http://swapi.co/api/people/64/" "http://swapi.co/api/people/65/"
## [141] "http://swapi.co/api/people/66/" "http://swapi.co/api/people/67/"
## [143] "http://swapi.co/api/people/67/" "http://swapi.co/api/people/68/"
## [145] "http://swapi.co/api/people/68/" "http://swapi.co/api/people/69/"
## [147] "http://swapi.co/api/people/70/" "http://swapi.co/api/people/71/"
## [149] "http://swapi.co/api/people/72/" "http://swapi.co/api/people/73/"
## [151] "http://swapi.co/api/people/74/" "http://swapi.co/api/people/47/"
## [153] "http://swapi.co/api/people/75/" "http://swapi.co/api/people/75/"
## [155] "http://swapi.co/api/people/76/" "http://swapi.co/api/people/77/"
## [157] "http://swapi.co/api/people/78/" "http://swapi.co/api/people/78/"
## [159] "http://swapi.co/api/people/79/" "http://swapi.co/api/people/80/"
## [161] "http://swapi.co/api/people/81/" "http://swapi.co/api/people/81/"
## [163] "http://swapi.co/api/people/82/" "http://swapi.co/api/people/82/"
## [165] "http://swapi.co/api/people/83/" "http://swapi.co/api/people/84/"
## [167] "http://swapi.co/api/people/85/" "http://swapi.co/api/people/86/"
## [169] "http://swapi.co/api/people/87/" "http://swapi.co/api/people/88/"
## [171] "http://swapi.co/api/people/35/" "http://swapi.co/api/people/35/"
## [173] "http://swapi.co/api/people/35/"
## [1] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/3/"
## [3] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/1/"
## [5] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/5/"
## [7] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [9] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/2/"
## [11] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/5/"
## [13] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [15] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/2/"
## [17] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/7/"
## [19] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/3/"
## [21] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/1/"
## [23] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/3/"
## [25] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/1/"
## [27] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/5/"
## [29] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/1/"
## [31] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/6/"
## [33] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/1/"
## [35] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/5/"
## [37] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [39] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/2/"
## [41] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/5/"
## [43] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [45] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/1/"
## [47] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/3/"
## [49] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/1/"
## [51] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/3/"
## [53] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/1/"
## [55] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/1/"
## [57] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/3/"
## [59] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/3/"
## [61] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/1/"
## [63] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/5/"
## [65] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [67] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/2/"
## [69] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/4/"
## [71] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/3/"
## [73] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/5/"
## [75] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/2/"
## [77] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/2/"
## [79] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/2/"
## [81] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/3/"
## [83] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/3/"
## [85] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/3/"
## [87] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/4/"
## [89] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/4/"
## [91] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/4/"
## [93] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/4/"
## [95] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
## [97] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/5/"
## [99] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
## [101] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/5/"
## [103] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
## [105] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/5/"
## [107] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [109] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
## [111] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/5/"
## [113] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [115] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/4/"
## [117] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/5/"
## [119] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [121] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [123] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [125] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [127] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/5/"
## [129] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/6/"
## [131] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/4/"
## [133] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
## [135] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
## [137] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/5/"
## [139] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/5/"
## [141] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
## [143] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/5/"
## [145] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/5/"
## [147] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
## [149] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
## [151] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/4/"
## [153] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/6/"
## [155] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
## [157] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/6/"
## [159] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [161] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/1/"
## [163] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/6/"
## [165] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/7/"
## [167] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/7/"
## [169] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/7/"
## [171] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/4/"
## [173] "http://swapi.co/api/films/6/"
Good work! Now you can use walk() to make your outputs cleaner and more human-readable.
3.3.2 walk() for printing cleaner list outputs
Now you will try one more use of walk()
, specifically creating plots using walk()
. In the previous exercise, you printed some lists, and you saw that printing lists is much cleaner using walk()
than using the base R way. You can also use walk()
to display multiple plots sequentially.
Here, use your map()
knowledge along with ggplot2
functions to create a graph for the first ten elements of gap_split
and then display each graph with walk()
.
gap_split
dataset.
# Load the gap_split data
data(gap_split)
map2()
over the first 10 elements of gap_split
, and the first 10 names of gap_split
.
# Map over the first 10 elements of gap_split
<- map2(gap_split[1:10],
plots names(gap_split[1:10]),
~ ggplot(.x, aes(year, lifeExp)) +
geom_line() +
labs(title = .y))
walk()
over the new plots object and supply print()
as an argument to print all plots.
# Object name, then function name
walk(plots, print)
Good work! Now you can print out multiple plots easily using walk()
.
4 Problem solving with purrr
Now that you have the building blocks, we will start tackling some more complex data problems with purrr.
4.1 Using purrr in your workflow
4.1.1 Name review
Now, you’ll quickly review how to check if a list has names, and how to
pull out a specific element from a list. Remember, you can use the names()
function to see if a list is named. There are several ways to extract a
named element from a list, but the key difference when working with
dataframes is to remember the [[double bracket]]
syntax.
-
Load the
gh_users
data.
# Load the data
data(gh_users)
-
Examine the names of
gh_users
.
# Check if data has names
names(gh_users)
## NULL
-
Extract the names for each element of
gh_users
.
# Map over name element of list
map(gh_users, ~.x[["name"]])
## [[1]]
## [1] "Gábor Csárdi"
##
## [[2]]
## [1] "Jennifer (Jenny) Bryan"
##
## [[3]]
## [1] "Jeff L."
##
## [[4]]
## [1] "Julia Silge"
##
## [[5]]
## [1] "Thomas J. Leeper"
##
## [[6]]
## [1] "Maëlle Salmon"
Good work, now we have refreshed the basics of named lists, we can dive into our next task.
4.1.2 Setting names
Setting list names makes working with lists much easier in many scenarios; it makes the code easier to read, which is especially important when reviewing code weeks or months later.
Here you are going to work with the gh_repos
and gh_users
datasets and set their names in two different ways. The two methods will give the same result: a list with named elements.
gh_users
using the “name” element and use the map_*()
function that outputs a character vector.
# Name gh_users with the names of the users
<- gh_users %>%
gh_users_named set_names(map_chr(gh_users, "name"))
gh_repos
to see where the owner info is stored.
# Check gh_repos structure
#str(gh_repos)
gh_repos_named
based on the login of the owner of the repo, using the set_names()
and map_*()
functions.
# Name gh_repos with the names of the repo owner
<- gh_repos %>%
gh_repos_named map_chr(~ .[[1]]$owner$login) %>%
set_names(gh_repos, .)
Good work! Sometimes list naming is tricky but purrr makes it simpler by easily extracting the element we want to use as the names.
4.1.3 Asking questions from a list
One of the great things about purrr
is you can easily move
from having a question about the data to an answer, with just a few
lines of code. Here you are going to use the gh_users
data to ask three questions:
- Which user joined GitHub first?
- Are all the repositories user-owned, rather than organization-owned?
- Which user has the most public repositories?
In this exercise, your map_*()
knowledge is really tested, so make sure to reflect on all the different flavors of map_*()
and how they should be used.
Name gh_users
with the “name”
element and sort the “created_at”
element to determine who joined GitHub first.
# Determine who joined github first
map_chr(gh_users, ~.x[["created_at"]]) %>%
set_names(map_chr(gh_users, "name")) %>%
sort()
## Jennifer (Jenny) Bryan Gábor Csárdi Jeff L.
## "2011-02-03T22:37:41Z" "2011-03-09T17:29:25Z" "2012-03-24T18:16:43Z"
## Thomas J. Leeper Maëlle Salmon Julia Silge
## "2013-02-07T21:07:00Z" "2014-08-05T08:10:04Z" "2015-05-19T02:51:23Z"
Output a vector that returns TRUE
for each element where the “type”
is “USER”
.
# Determine user versus organization
map_lgl(gh_users, ~.x[["type"]] == "User")
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
Output a named numeric vector of the number of “public_repos”
.
# Determine who has the most public repositories
map_int(gh_users, ~.x[["public_repos"]]) %>%
set_names(map_chr(gh_users, "name")) %>%
sort()
## Julia Silge Maëlle Salmon Gábor Csárdi
## 26 31 52
## Jeff L. Thomas J. Leeper Jennifer (Jenny) Bryan
## 67 99 168
Good work! Now you can use functions you already know to ask any question of your data in just a few lines of code.
4.2 Even more complex problems
Questions about gh_repos
You’re going to use gh_repos
again, a list where each element is information about a GitHub repository. Here you will use map()
and map_dbl()
to answer the question:
- Which repository is the largest?’
GitHub repository size is measured in megabytes. This information could be useful to document if you are working with a list based dataset that changes over time, and need to be able to pull out information, like the largest repository, in the most recent dataset.
-
map()
overgh_repos
. -
map_dbl()
over the `“size” element. -
Then
map()
to determine which repo is the largest.
# Map over gh_repos to generate numeric output
map(gh_repos,
~map_dbl(.x,
~.x[["size"]])) %>%
# Grab the largest element
map(~max(.x))
## [[1]]
## [1] 39461
##
## [[2]]
## [1] 96325
##
## [[3]]
## [1] 374812
##
## [[4]]
## [1] 24070
##
## [[5]]
## [1] 558176
##
## [[6]]
## [1] 76455
Good work! You’re gaining great skills to be able to answer questions in a reproducible way with your datasets.
4.3 Graphs in purrr
4.3.1 ggplot() refresher
You’ve already been introduced to the package ggplot2
in the prerequisite for this course, but let’s do a quick refresher.
-
geom_point()
makes scatterplots -
geom_histogram()
makes histograms
In this exercise, you are going to use a dataframe created from the gh_users
dataset, called gh_users_df
that has two columns; one for the number of public repositories a user
has and another for how many followers that user has. Each row is a
different user. Then you will make it into a scatter plot, a plot where
the data are displayed with points.
Create a scatterplot with public_repos
on the x
axis and followers
on the y
axis.
=tribble(~public_repos, ~followers,
gh_users_df52, 303,
168, 780,
67, 3958,
26, 115,
99, 213,
31, 34)
# Scatter plot of public repos and followers
ggplot(data = gh_users_df,
aes(x = public_repos, y = followers))+
geom_point()
Create a histogram of followers
by piping in gh_users_df
.
# Histogram of followers
%>%
gh_users_df ggplot(aes(x = followers))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Good work! Isn’t making plots fun? Now let’s dive into how purrr
can help make more of them!
4.3.2 purrr and scatterplots
Since ggplot()
does not accept lists as an input, it can be paired up with purrr
to go from a list to a dataframe to a ggplot()
graph in just a few lines of code.
You will continue to work with the gh_users
data for this exercise. You will use a map_*()
function to pull out a few of the named elements and transform them
into the correct datatype. Then create a scatterplot that compares the
user’s number of followers to the user’s number of public repositories.
-
map()
overgh_users
, use themap_*()
function that creates a dataframe, with four columns, named“login”
,“name”
,“followers”
and“public_repos”
. -
Pipe that dataframe into a scatterplot, where the
x
axis isfollowers
andy
ispublic_repos
.
# Create a dataframe with four columns
map_df(gh_users, `[`,
c("login","name","followers","public_repos")) %>%
# Plot followers by public_repos
ggplot(.,
aes(x = followers, y = public_repos)) +
# Create scatter plots
geom_point()
Good work! Now you can go from list to plot using a tidy workflow!
4.3.3 purrr and histograms
Now you’re going to put together everything you’ve learned, starting
with two different lists, which will be turned into a faceted histogram.
You’re going to work again with the Stars Wars data from the sw_films
and sw_people
datasets to answer a question:
- What is the distribution of heights of characters in each of the Star Wars films?
Different movies take place on different sets of planets, so you might
expect to see different distributions of heights from the characters.
Your first task is to transform the two datasets into dataframes since ggplot()
requires a dataframe input. Then you will join them together, and plot
the result, a histogram with a different facet, or subplot, for each
film.
“title”
of each film, and the “characters”
from each film in the sw_films
dataset.
# Turn data into correct dataframe format
<- tibble(filmtitle = map_chr(sw_films, "title")) %>%
film_by_character mutate(filmtitle, characters = map(sw_films, "characters")) %>%
unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(characters)`
“height”
, “mass”
, “name”
, and “url”
elements from sw_people
.
# Pull out elements from sw_people
<- map_df(sw_people, `[`, c("height","mass","name","url")) sw_characters
“characters
” and “url
” keys.
# Join our two new objects
<- inner_join(film_by_character, sw_characters, by = c("characters" = "url")) %>%
character_data # Make sure the columns are numbers
mutate(height = as.numeric(height), mass = as.numeric(mass))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
ggplot()
histogram with x = height
, faceted by filmtitle
.
# Plot the heights, faceted by film title
ggplot(character_data, aes(x = height)) +
geom_histogram(stat = "count") +
facet_wrap(~ filmtitle)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 6 rows containing non-finite values (stat_count).
Good work! Now you’ve learned all the basics of how you can use purrr
to make tasks that require iteration and working with lists, more manageable, and human readable!