Joining parts and part categories
The inner_join
is the key to bring tables together. To use it, you need to provide the two tables that must be joined and the columns on which they should be joined.
In this exercise, you’ll join a list of LEGO parts, available as parts
, with these parts’ corresponding categories, available as part_categories
. For example, the part Sticker Sheet 1 for Set 1650-1
is from the Stickers
part category. You can join these tables to see all parts’ categories!
parts <- readRDS("_data/parts.rds")
part_categories <- readRDS("_data/part_categories.rds")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Add the correct verb, table, and joining column
parts %>%
inner_join(part_categories, by = c("part_cat_id" = "id"))
## # A tibble: 17,501 x 4
## part_num name.x part_cat_id name.y
## <chr> <chr> <dbl> <chr>
## 1 0901 Baseplate 16 x 30 with Set 080 Yellow ~ 1 Baseplates
## 2 0902 Baseplate 16 x 24 with Set 080 Small W~ 1 Baseplates
## 3 0903 Baseplate 16 x 24 with Set 080 Red Hou~ 1 Baseplates
## 4 0904 Baseplate 16 x 24 with Set 080 Large W~ 1 Baseplates
## 5 1 Homemaker Bookcase 2 x 4 x 4 7 Containers
## 6 10016414 Sticker Sheet #1 for 41055-1 58 Stickers
## 7 10026stk01 Sticker for Set 10026 - (44942/4184185) 58 Stickers
## 8 10039 Pullback Motor 8 x 4 x 2/3 44 Mechanical
## 9 10048 Minifig Hair Tousled 65 Minifig Headw~
## 10 10049 Minifig Shield Broad with Spiked Botto~ 27 Minifig Acces~
## # ... with 17,491 more rows
# Use the suffix argument to replace .x and .y suffixes
parts %>%
inner_join(part_categories, by = c("part_cat_id" = "id"),
suffix = c("_part", "_category"))
## # A tibble: 17,501 x 4
## part_num name_part part_cat_id name_category
## <chr> <chr> <dbl> <chr>
## 1 0901 Baseplate 16 x 30 with Set 080 Yello~ 1 Baseplates
## 2 0902 Baseplate 16 x 24 with Set 080 Small~ 1 Baseplates
## 3 0903 Baseplate 16 x 24 with Set 080 Red H~ 1 Baseplates
## 4 0904 Baseplate 16 x 24 with Set 080 Large~ 1 Baseplates
## 5 1 Homemaker Bookcase 2 x 4 x 4 7 Containers
## 6 10016414 Sticker Sheet #1 for 41055-1 58 Stickers
## 7 10026stk01 Sticker for Set 10026 - (44942/41841~ 58 Stickers
## 8 10039 Pullback Motor 8 x 4 x 2/3 44 Mechanical
## 9 10048 Minifig Hair Tousled 65 Minifig Headwear
## 10 10049 Minifig Shield Broad with Spiked Bot~ 27 Minifig Accesso~
## # ... with 17,491 more rows
Great job! You can now see the part category associated with each part, and you used the suffix
argument to make sure your table’s fields are clear.
Joining parts and inventories
The LEGO data has many tables that can be joined together. Often times, some of the things you care about may be a few tables away (we’ll get to that later in the course). For now, we know that parts
is a list of all LEGO parts, and a new table, inventory_parts
, has some additional information about those parts, such as the color_id
of each part you would find in a specific LEGO kit.
Let’s join these two tables together to observe how joining parts
with inventory_parts
increases the size of your table because of the one-to-many relationship that exists between these two tables.
inventory_parts <- readRDS("_data/inventory_parts.rds")
# Combine the parts and inventory_parts tables
parts %>%
inner_join(inventory_parts, by = "part_num")
## # A tibble: 258,958 x 6
## part_num name part_cat_id inventory_id color_id quantity
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 0901 Baseplate 16 x 30 with S~ 1 1973 2 1
## 2 0902 Baseplate 16 x 24 with S~ 1 1973 2 1
## 3 0903 Baseplate 16 x 24 with S~ 1 1973 2 1
## 4 0904 Baseplate 16 x 24 with S~ 1 1973 2 1
## 5 1 Homemaker Bookcase 2 x 4~ 7 508 15 1
## 6 1 Homemaker Bookcase 2 x 4~ 7 1158 15 2
## 7 1 Homemaker Bookcase 2 x 4~ 7 6590 15 2
## 8 1 Homemaker Bookcase 2 x 4~ 7 9679 15 2
## 9 1 Homemaker Bookcase 2 x 4~ 7 12256 1 2
## 10 1 Homemaker Bookcase 2 x 4~ 7 13356 15 1
## # ... with 258,948 more rows
Awesome! Recall, this is an example of a one-to-many relationship. Notice that the table increased in the number of rows after the join.
Joining in either direction
An inner_join
works the same way with either table in either position. The table that is specified first is arbitrary, since you will end up with the same information in the resulting table either way.
Let’s prove this by joining the same two tables from the last exercise in the opposite order!
# Combine the parts and inventory_parts tables
inventory_parts %>%
inner_join(parts, by = "part_num")
## # A tibble: 258,958 x 6
## inventory_id part_num color_id quantity name part_cat_id
## <dbl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 21 3009 7 50 Brick 1 x 6 11
## 2 25 21019c00pa~ 15 1 Legs and Hips with Bl~ 61
## 3 25 24629pr0002 78 1 Minifig Head Special ~ 59
## 4 25 24634pr0001 5 1 Headwear Accessory Bo~ 27
## 5 25 24782pr0001 5 1 Minifig Hipwear Skirt~ 27
## 6 25 88646 0 1 Tile Special 4 x 3 wi~ 15
## 7 25 973pr3314c~ 5 1 Torso with 1 White Bu~ 60
## 8 26 14226c11 0 3 String with End Studs~ 31
## 9 26 2340px2 15 1 Tail 4 x 1 x 3 with '~ 35
## 10 26 2340px3 15 1 Tail 4 x 1 x 3 with '~ 35
## # ... with 258,948 more rows
Great! This is the same join as the last exercise, but the order of the tables is reversed. For an inner_join
, either direction will yield a table that contains the same information! Note that the columns will appear in a different order depending on which table comes first.
Joining three tables
You can string together multiple joins with inner_join
and the pipe (%>%
), both with which you are already very familiar!
We’ll now connect sets
, a table that tells us about each LEGO kit, with inventories
, a table that tells us the specific version of a given set, and finally to inventory_parts
, a table which tells us how many of each part is available in each LEGO kit.
So if you were building a Batman LEGO set, sets
would tell you the name of the set, inventories
would give you IDs for each of the versions of the set, and inventory_parts
would tell you how many of each part would be in each version.
sets <- readRDS("_data/sets.rds")
inventories <- readRDS("_data/inventories.rds")
sets %>%
# Add inventories using an inner join
inner_join(inventories, by = "set_num") %>%
# Add inventory_parts using an inner join
inner_join(inventory_parts, by = c("id" = "inventory_id"))
## # A tibble: 258,958 x 9
## set_num name year theme_id id version part_num color_id quantity
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 700.3-1 Medium Gift ~ 1949 365 24197 1 bdoor01 2 2
## 2 700.3-1 Medium Gift ~ 1949 365 24197 1 bdoor01 15 1
## 3 700.3-1 Medium Gift ~ 1949 365 24197 1 bdoor01 4 1
## 4 700.3-1 Medium Gift ~ 1949 365 24197 1 bslot02 15 6
## 5 700.3-1 Medium Gift ~ 1949 365 24197 1 bslot02 2 6
## 6 700.3-1 Medium Gift ~ 1949 365 24197 1 bslot02 4 6
## 7 700.3-1 Medium Gift ~ 1949 365 24197 1 bslot02 1 6
## 8 700.3-1 Medium Gift ~ 1949 365 24197 1 bslot02 14 6
## 9 700.3-1 Medium Gift ~ 1949 365 24197 1 bslot02a 15 6
## 10 700.3-1 Medium Gift ~ 1949 365 24197 1 bslot02a 2 6
## # ... with 258,948 more rows
Nice! You joined together multiple tables using inner_join()
and the pipe!
What’s the most common color?
Now let’s join an additional table, colors
, which will tell us the color of each part in each set, so that we can answer the question, “what is the most common color of a LEGO piece?”
colors <- readRDS("_data/colors.rds")
# Add an inner join for the colors table
sets %>%
inner_join(inventories, by = "set_num") %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
inner_join(colors, by = c("color_id" = "id"), suffix=c("_set", "_color"))
## # A tibble: 258,958 x 11
## set_num name_set year theme_id id version part_num color_id quantity
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 700.3-1 Medium ~ 1949 365 24197 1 bdoor01 2 2
## 2 700.3-1 Medium ~ 1949 365 24197 1 bdoor01 15 1
## 3 700.3-1 Medium ~ 1949 365 24197 1 bdoor01 4 1
## 4 700.3-1 Medium ~ 1949 365 24197 1 bslot02 15 6
## 5 700.3-1 Medium ~ 1949 365 24197 1 bslot02 2 6
## 6 700.3-1 Medium ~ 1949 365 24197 1 bslot02 4 6
## 7 700.3-1 Medium ~ 1949 365 24197 1 bslot02 1 6
## 8 700.3-1 Medium ~ 1949 365 24197 1 bslot02 14 6
## 9 700.3-1 Medium ~ 1949 365 24197 1 bslot02a 15 6
## 10 700.3-1 Medium ~ 1949 365 24197 1 bslot02a 2 6
## # ... with 258,948 more rows, and 2 more variables: name_color <chr>, rgb <chr>
# Count the number of colors and sort
sets %>%
inner_join(inventories, by = "set_num") %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
inner_join(colors, by = c("color_id" = "id"), suffix = c("_set", "_color")) %>%
count(name_color, sort = TRUE)
## # A tibble: 134 x 2
## name_color n
## <chr> <int>
## 1 Black 48068
## 2 White 30105
## 3 Light Bluish Gray 26024
## 4 Red 21602
## 5 Dark Bluish Gray 19948
## 6 Yellow 17088
## 7 Blue 12980
## 8 Light Gray 8632
## 9 Reddish Brown 6960
## 10 Tan 6664
## # ... with 124 more rows
Great! Notice that Black and White are the two most prominent colors.
##The left_join verb
Left joining two sets by part and color
In the video, you learned how to left join two LEGO sets. Now you’ll practice your ability to do this looking at two new sets: the Millennium Falcon and Star Destroyer sets. We’ve created these for you and they have been preloaded for you:
millennium_falcon <- inventory_parts_joined %>%
filter(set_num == "7965-1")
star_destroyer <- inventory_parts_joined %>%
filter(set_num == "75190-1")
inventory_parts_joined <- sets %>%
inner_join(inventories, by = "set_num") %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
inner_join(colors, by = c("color_id" = "id"), suffix=c("_set", "_color")) %>%
select(c("set_num", "part_num", "color_id", "quantity")) %>% arrange(desc(quantity))
millennium_falcon <- inventory_parts_joined %>%
filter(set_num == "7965-1")
star_destroyer <- inventory_parts_joined %>%
filter(set_num == "75190-1")
# Combine the star_destroyer and millennium_falcon tables
millennium_falcon %>%
left_join(star_destroyer, by = c("part_num", "color_id"), suffix = c("_falcon", "_star_destroyer"))
## # A tibble: 263 x 6
## set_num_falcon part_num color_id quantity_falcon set_num_star_de~
## <chr> <chr> <dbl> <dbl> <chr>
## 1 7965-1 63868 71 62 <NA>
## 2 7965-1 3023 0 60 <NA>
## 3 7965-1 3021 72 46 75190-1
## 4 7965-1 2780 0 37 75190-1
## 5 7965-1 60478 72 36 <NA>
## 6 7965-1 6636 71 34 75190-1
## 7 7965-1 3009 71 28 75190-1
## 8 7965-1 3665 71 22 <NA>
## 9 7965-1 2412b 72 20 75190-1
## 10 7965-1 3010 71 19 <NA>
## # ... with 253 more rows, and 1 more variable: quantity_star_destroyer <dbl>
Awesome! Using this table, and the part_num
or color_id
columns, you can determine the frequency that a specific part or piece appears in the Millennium Falcon and Star Destroyer sets!
Left joining two sets by color
In the videos and the last exercise, you joined two sets based on their part and color. What if you joined the datasets by color alone? As with the last exercise, the Millennium Falcon and Star Destroyer sets have been created and preloaded for you:
millennium_falcon <- inventory_parts_joined %>%
filter(set_num == "7965-1")
star_destroyer <- inventory_parts_joined %>%
filter(set_num == "75190-1")
# Aggregate Millennium Falcon for the total quantity in each part
millennium_falcon_colors <- millennium_falcon %>%
group_by(color_id) %>%
summarize(total_quantity = sum(quantity))
## `summarise()` ungrouping output (override with `.groups` argument)
# Aggregate Star Destroyer for the total quantity in each part
star_destroyer_colors <- star_destroyer %>%
group_by(color_id) %>%
summarize(total_quantity = sum(quantity))
## `summarise()` ungrouping output (override with `.groups` argument)
# Left join the Millennium Falcon colors to the Star Destroyer colors
millennium_falcon_colors %>%
left_join(star_destroyer_colors, by = "color_id", suffix= c("_falcon", "_star_destroyer"))
## # A tibble: 21 x 3
## color_id total_quantity_falcon total_quantity_star_destroyer
## <dbl> <dbl> <dbl>
## 1 0 201 336
## 2 1 15 23
## 3 4 17 53
## 4 14 3 4
## 5 15 15 17
## 6 19 95 12
## 7 28 3 16
## 8 33 5 NA
## 9 36 1 14
## 10 41 6 15
## # ... with 11 more rows
That’s right! Sometimes, the level on which you’re comparing two sets- by piece or just by color- will require some thought and some pre-processing.
Finding an observation that doesn’t have a match
Left joins are really great for testing your assumptions about a data set and ensuring your data has integrity.
For example, the inventories
table has a version
column, for when a LEGO kit gets some kind of change or upgrade. It would be fair to assume that all sets
(which joins well with inventories
) would have at least a version 1. But let’s test this assumption out in the following exercise.
inventory_version_1 <- inventories %>%
filter(version == 1)
# Join versions to sets
sets %>%
left_join(inventory_version_1, by = "set_num") %>%
# Filter for where version is na
filter(is.na(version))
## # A tibble: 1 x 6
## set_num name year theme_id id version
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 40198-1 Ludo game 2018 598 NA NA
Hey! It looks like there are cases where a set does not have an original version. In Chapter 3, you’ll learn another way to find observations like this: anti_join
.
Counting part colors
Sometimes you’ll want to do some processing before you do a join, and prioritize keeping the second (right) table’s rows instead. In this case, a right join is for you.
In the example below, we’ll count the part_cat_id
from parts, before using a right_join
to join with part_categories
. The reason we do this is because we don’t only want to know the count of part_cat_id
in parts
, but we also want to know if there are any part_cat_id
s not present in parts
.
parts %>%
# Count the part_cat_id
count(part_cat_id) %>%
# Right join part_categories
right_join(part_categories, by = c("part_cat_id" = "id"))
## # A tibble: 64 x 3
## part_cat_id n name
## <dbl> <int> <chr>
## 1 1 135 Baseplates
## 2 3 303 Bricks Sloped
## 3 4 1900 Duplo, Quatro and Primo
## 4 5 107 Bricks Special
## 5 6 128 Bricks Wedged
## 6 7 97 Containers
## 7 8 24 Technic Bricks
## 8 9 167 Plates Special
## 9 11 490 Bricks
## 10 12 85 Technic Connectors
## # ... with 54 more rows
parts %>%
count(part_cat_id) %>%
right_join(part_categories, by = c("part_cat_id" = "id")) %>%
# Filter for NA
filter(is.na(n))
## # A tibble: 1 x 3
## part_cat_id n name
## <dbl> <int> <chr>
## 1 66 NA Modulex
Great! From this filter, you found an instance where a part category is present in one table, but missing from the other table. It’s important to understand which entries would be impacted by replace_na(
), so that we know which entries we would be omitting by using that function.
Cleaning up your count
In both left and right joins, there is the opportunity for there to be NA values in the resulting table. Fortunately, the replace_na
function can turn those NAs into meaningful values.
In the last exercise, we saw that the n
column had NAs after the right_join
. Let’s use the replace_na
column, which takes a list
of column names and the values with which NAs should be replaced, to clean up our table.
library(tidyr)
parts %>%
count(part_cat_id) %>%
right_join(part_categories, by = c("part_cat_id" = "id")) %>%
# Use replace_na to replace missing values in the n column
replace_na(list(n = 0))
## # A tibble: 64 x 3
## part_cat_id n name
## <dbl> <dbl> <chr>
## 1 1 135 Baseplates
## 2 3 303 Bricks Sloped
## 3 4 1900 Duplo, Quatro and Primo
## 4 5 107 Bricks Special
## 5 6 128 Bricks Wedged
## 6 7 97 Containers
## 7 8 24 Technic Bricks
## 8 9 167 Plates Special
## 9 11 490 Bricks
## 10 12 85 Technic Connectors
## # ... with 54 more rows
Awesome! Now you have replaced NAs with 0s for your analysis.
Joining themes to their children
Tables can be joined to themselves!
In the themes
table, which is available for you to inspect in the console, you’ll notice there is both an id
column and a parent_id
column. Keeping that in mind, you can join the themes
table to itself to determine the parent-child relationships that exist for different themes.
In the videos, you saw themes joined to their own parents. In this exercise, you’ll try a similar approach of joining themes to their own children, which is similar but reversed. Let’s try this out to discover what children the theme "Harry Potter"
has.
themes <- readRDS("_data/themes.rds")
themes
## # A tibble: 665 x 3
## id name parent_id
## <dbl> <chr> <dbl>
## 1 1 Technic NA
## 2 2 Arctic Technic 1
## 3 3 Competition 1
## 4 4 Expert Builder 1
## 5 5 Model 1
## 6 6 Airport 5
## 7 7 Construction 5
## 8 8 Farm 5
## 9 9 Fire 5
## 10 10 Harbor 5
## # ... with 655 more rows
themes %>%
# Inner join the themes table
inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
# Filter for the "Harry Potter" parent name
filter(name_parent == "Harry Potter")
## # A tibble: 6 x 5
## id name_parent parent_id id_child name_child
## <dbl> <chr> <dbl> <dbl> <chr>
## 1 246 Harry Potter NA 247 Chamber of Secrets
## 2 246 Harry Potter NA 248 Goblet of Fire
## 3 246 Harry Potter NA 249 Order of the Phoenix
## 4 246 Harry Potter NA 250 Prisoner of Azkaban
## 5 246 Harry Potter NA 251 Sorcerer's Stone
## 6 246 Harry Potter NA 667 Fantastic Beasts
Great! It looks like the Harry Potter parent theme has a few children, which are named after the books and movies from the series.
Joining themes to their grandchildren
We can go a step further than looking at themes and their children. Some themes actually have grandchildren: their children’s children.
Here, we can inner join themes
to a filtered version of itself again
to establish a connection between our last join’s children and their
children.
# Join themes to itself again to find the grandchild relationships
themes %>%
inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
inner_join(themes, by = c("id_child" = "parent_id"), suffix = c("_parent", "_grandchild"))
## # A tibble: 158 x 7
## id_parent name_parent parent_id id_child name_child id_grandchild name
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1 Technic NA 5 Model 6 Airport
## 2 1 Technic NA 5 Model 7 Constructi~
## 3 1 Technic NA 5 Model 8 Farm
## 4 1 Technic NA 5 Model 9 Fire
## 5 1 Technic NA 5 Model 10 Harbor
## 6 1 Technic NA 5 Model 11 Off-Road
## 7 1 Technic NA 5 Model 12 Race
## 8 1 Technic NA 5 Model 13 Riding Cyc~
## 9 1 Technic NA 5 Model 14 Robot
## 10 1 Technic NA 5 Model 15 Traffic
## # ... with 148 more rows
Awesome! It looks like there are 158 grandparent-grandchild relationships of LEGO themes. Now you’re a whiz at using self joins to learn about hierarchical relationships.
Left-joining a table to itself
So far, you’ve been inner joining a table to itself in order to find the children of themes like "Harry Potter"
or "The Lord of the Rings"
.
But some themes might not have any children at all, which means they won’t be included in the inner join. As you’ve learned in this chapter, you can identify those with a left_join
and a filter()
.
themes %>%
# Left join the themes table to its own children
left_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
# Filter for themes that have no child themes
filter(is.na(name_child))
## # A tibble: 586 x 5
## id name_parent parent_id id_child name_child
## <dbl> <chr> <dbl> <dbl> <chr>
## 1 2 Arctic Technic 1 NA <NA>
## 2 3 Competition 1 NA <NA>
## 3 4 Expert Builder 1 NA <NA>
## 4 6 Airport 5 NA <NA>
## 5 7 Construction 5 NA <NA>
## 6 8 Farm 5 NA <NA>
## 7 9 Fire 5 NA <NA>
## 8 10 Harbor 5 NA <NA>
## 9 11 Off-Road 5 NA <NA>
## 10 12 Race 5 NA <NA>
## # ... with 576 more rows
Wow! From this table, we learned that there are 586 themes in total that do not have any child themes.
Differences between Batman and Star Wars
In the video, you compared two sets. Now, you’ll compare two themes, each of which is made up of many sets.
First, you’ll need to join in the themes
. Recall that doing so requires going through the sets
first. You’ll use the inventory_parts_joined
table from the video, which is already available to you in the console.
inventory_parts_joined <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
arrange(desc(quantity)) %>%
select(-id, -version)
inventory_parts_joined <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
arrange(desc(quantity)) %>%
select(-id, -version)
# Start with inventory_parts_joined table
inventory_parts_joined %>%
# Combine with the sets table
inner_join(sets, by = "set_num") %>%
# Combine with the themes table
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
## # A tibble: 258,958 x 9
## set_num part_num color_id quantity name_set year theme_id name_theme
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 40179-1 3024 72 900 Persona~ 2016 277 Mosaic
## 2 40179-1 3024 15 900 Persona~ 2016 277 Mosaic
## 3 40179-1 3024 0 900 Persona~ 2016 277 Mosaic
## 4 40179-1 3024 71 900 Persona~ 2016 277 Mosaic
## 5 40179-1 3024 14 900 Persona~ 2016 277 Mosaic
## 6 k34434~ 3024 15 810 Lego Mo~ 2003 277 Mosaic
## 7 21010-1 3023 320 771 Robie H~ 2011 252 Architect~
## 8 k34431~ 3024 0 720 Lego Mo~ 2003 277 Mosaic
## 9 42083-1 2780 0 684 Bugatti~ 2018 5 Model
## 10 k34434~ 3024 0 540 Lego Mo~ 2003 277 Mosaic
## # ... with 258,948 more rows, and 1 more variable: parent_id <dbl>
Good work! Since each theme is made up of many sets, combining these tables is the first step towards being able to compare different themes.
Aggregating each theme
Previously, you combined tables to compare themes. Before doing this comparison, you’ll want to aggregate the data to learn more about the pieces that are a part of each theme, as well as the colors of those pieces.
The table you created previously has been preloaded for you as inventory_sets_themes
. It was filtered for each theme, and the objects have been saved as batman
and star_wars
.
inventory_sets_themes <- inventory_parts_joined %>%
inner_join(sets, by = "set_num") %>%
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
batman <- inventory_sets_themes %>%
filter(name_theme == "Batman")
star_wars <- inventory_sets_themes %>%
filter(name_theme == "Star Wars")
inventory_sets_themes <- inventory_parts_joined %>%
inner_join(sets, by = "set_num") %>%
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
inventory_sets_themes
## # A tibble: 258,958 x 9
## set_num part_num color_id quantity name_set year theme_id name_theme
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 40179-1 3024 72 900 Persona~ 2016 277 Mosaic
## 2 40179-1 3024 15 900 Persona~ 2016 277 Mosaic
## 3 40179-1 3024 0 900 Persona~ 2016 277 Mosaic
## 4 40179-1 3024 71 900 Persona~ 2016 277 Mosaic
## 5 40179-1 3024 14 900 Persona~ 2016 277 Mosaic
## 6 k34434~ 3024 15 810 Lego Mo~ 2003 277 Mosaic
## 7 21010-1 3023 320 771 Robie H~ 2011 252 Architect~
## 8 k34431~ 3024 0 720 Lego Mo~ 2003 277 Mosaic
## 9 42083-1 2780 0 684 Bugatti~ 2018 5 Model
## 10 k34434~ 3024 0 540 Lego Mo~ 2003 277 Mosaic
## # ... with 258,948 more rows, and 1 more variable: parent_id <dbl>
batman <- inventory_sets_themes %>%
filter(name_theme == "Batman")
batman
## # A tibble: 3,783 x 9
## set_num part_num color_id quantity name_set year theme_id name_theme
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 7787-1 3873 0 158 The Bat~ 2007 484 Batman
## 2 70904-1 6141 84 81 Clayfac~ 2017 484 Batman
## 3 70904-1 4032a 84 67 Clayfac~ 2017 484 Batman
## 4 77903-1 3023 46 46 The Dar~ 2019 484 Batman
## 5 7787-1 6558 0 44 The Bat~ 2007 484 Batman
## 6 70904-1 11477 84 41 Clayfac~ 2017 484 Batman
## 7 76054-1 6141 179 31 Batman:~ 2016 484 Batman
## 8 7787-1 2723 0 28 The Bat~ 2007 484 Batman
## 9 77903-1 3023 72 28 The Dar~ 2019 484 Batman
## 10 76035-1 98138 34 26 Jokerla~ 2015 484 Batman
## # ... with 3,773 more rows, and 1 more variable: parent_id <dbl>
star_wars <- inventory_sets_themes %>%
filter(name_theme == "Star Wars")
star_wars
## # A tibble: 5,402 x 9
## set_num part_num color_id quantity name_set year theme_id name_theme
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 7194-1 2357 19 84 Yoda 2002 158 Star Wars
## 2 7194-1 3001 19 73 Yoda 2002 158 Star Wars
## 3 7194-1 2420 378 72 Yoda 2002 158 Star Wars
## 4 75244-1 2780 0 64 Tantive~ 2019 158 Star Wars
## 5 7194-1 3002 19 57 Yoda 2002 158 Star Wars
## 6 7194-1 3004 19 55 Yoda 2002 158 Star Wars
## 7 7194-1 3010 19 54 Yoda 2002 158 Star Wars
## 8 8001-1 32062 0 54 Battle ~ 2000 158 Star Wars
## 9 75094-1 2780 0 52 Imperia~ 2015 158 Star Wars
## 10 75244-1 15068 15 49 Tantive~ 2019 158 Star Wars
## # ... with 5,392 more rows, and 1 more variable: parent_id <dbl>
# Count the part number and color id, weight by quantity
batman %>%
count(part_num, color_id, wt = quantity)
## # A tibble: 2,071 x 3
## part_num color_id n
## <chr> <dbl> <dbl>
## 1 10113 0 11
## 2 10113 272 1
## 3 10113 320 1
## 4 10183 57 1
## 5 10190 0 2
## 6 10201 0 1
## 7 10201 4 3
## 8 10201 14 1
## 9 10201 15 6
## 10 10201 71 4
## # ... with 2,061 more rows
star_wars %>%
count(part_num, color_id, wt = quantity)
## # A tibble: 2,413 x 3
## part_num color_id n
## <chr> <dbl> <dbl>
## 1 10169 4 1
## 2 10197 0 2
## 3 10197 72 3
## 4 10201 0 21
## 5 10201 71 5
## 6 10247 0 9
## 7 10247 71 16
## 8 10247 72 12
## 9 10884 28 1
## 10 10928 72 6
## # ... with 2,403 more rows
Great aggregating! Now that your information about the pieces in each of these themes is more organized, you are closer to being able to compare them to learn more about the similarities and differences between these themes!
Full-joining Batman and Star Wars LEGO parts
Now that you’ve got separate tables for the pieces in the batman
and star_wars
themes, you’ll want to be able to combine them to see any similarities or differences between the two themes. The aggregating from the last exercise has been saved as batman_parts
and star_wars_parts
, and is preloaded for you.
batman_parts <- batman %>%
count(part_num, color_id, wt = quantity)
star_wars_parts <- star_wars %>%
count(part_num, color_id, wt = quantity)
batman_parts <- batman %>%
count(part_num, color_id, wt = quantity)
star_wars_parts <- star_wars %>%
count(part_num, color_id, wt = quantity)
batman_parts %>%
# Combine the star_wars_parts table
full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars")) %>%
# Replace NAs with 0s in the n_batman and n_star_wars columns
replace_na(list(n_batman = 0, n_star_wars = 0))
## # A tibble: 3,628 x 4
## part_num color_id n_batman n_star_wars
## <chr> <dbl> <dbl> <dbl>
## 1 10113 0 11 0
## 2 10113 272 1 0
## 3 10113 320 1 0
## 4 10183 57 1 0
## 5 10190 0 2 0
## 6 10201 0 1 21
## 7 10201 4 3 0
## 8 10201 14 1 0
## 9 10201 15 6 0
## 10 10201 71 4 5
## # ... with 3,618 more rows
Awesome! Now, you’ve created a comprehensive table that includes the part number, color id, and quantity of each piece in the Batman and Star Wars themes!
Comparing Batman and Star Wars LEGO parts
The table you created in the last exercise includes the part number of each piece, the color id, and the number of each piece in the Star Wars and Batman themes. However, we have more information about each of these parts that we can gain by combining this table with some of the information we have in other tables. Before we compare the themes, let’s ensure that we have enough information to make our findings more interpretable. The table from the last exercise has been saved as parts_joined
and is preloaded for you.
parts_joined <- batman_parts %>%
full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars")) %>%
replace_na(list(n_batman = 0, n_star_wars = 0))
parts_joined <- batman_parts %>%
full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars")) %>%
replace_na(list(n_batman = 0, n_star_wars = 0))
parts_joined %>%
# Sort the number of star wars pieces in descending order
arrange(desc(n_star_wars)) %>%
# Join the colors table to the parts_joined table
inner_join(colors, by = c("color_id" = "id")) %>%
# Join the parts table to the previous join
inner_join(parts, by = "part_num", suffix = c("_color", "_part"))
## # A tibble: 3,628 x 8
## part_num color_id n_batman n_star_wars name_color rgb name_part part_cat_id
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 2780 0 104 392 Black #051~ Technic ~ 53
## 2 32062 0 1 141 Black #051~ Technic ~ 46
## 3 4274 1 56 118 Blue #005~ Technic ~ 53
## 4 6141 36 11 117 Trans-Red #C91~ Plate Ro~ 21
## 5 3023 71 10 106 Light Blu~ #A0A~ Plate 1 ~ 14
## 6 6558 1 30 106 Blue #005~ Technic ~ 53
## 7 43093 1 44 99 Blue #005~ Technic ~ 53
## 8 3022 72 14 95 Dark Blui~ #6C6~ Plate 2 ~ 14
## 9 2357 19 0 84 Tan #E4C~ Brick 2 ~ 11
## 10 6141 179 90 81 Flat Silv~ #898~ Plate Ro~ 21
## # ... with 3,618 more rows
You can now see the pieces side by side from each of these themes. Since the pieces are sorted by number of Star Wars pieces in descending order, you can see that the most common Star Wars piece is Black and has the part number 2780
. While there are 392 pieces of this part in the Star Wars theme, you can also see from the table that there are 104 pieces of the same part in the Batman theme. Well done!
Mutating verbs
inner_join
left_join
right_join
full_join
Filtering joins
semi_join()
anti_join()
Something within one set but not another
In the videos, you learned how to filter using the semi- and anti-join verbs to answer questions you have about your data. Let’s focus on the batwing
dataset, and use our skills to determine which parts are in both the batwing
and batmobile
sets, and which sets are in one, but not the other. While answering these questions, we’ll also be determining whether or not the parts we’re looking at in both sets also have the same color in common.
The batmobile
and batwing
datasets have been preloaded for you.
batmobile <- inventory_parts_joined %>%
filter(set_num == "7784-1") %>%
select(-set_num)
batwing <- inventory_parts_joined %>%
filter(set_num == "70916-1") %>%
select(-set_num)
batmobile <- inventory_parts_joined %>%
filter(set_num == "7784-1") %>%
select(-set_num)
batwing <- inventory_parts_joined %>%
filter(set_num == "70916-1") %>%
select(-set_num)
# Filter the batwing set for parts that are also in the batmobile set
batwing %>%
semi_join(batmobile, by = c("part_num"))
## # A tibble: 126 x 3
## part_num color_id quantity
## <chr> <dbl> <dbl>
## 1 3023 0 22
## 2 3024 0 22
## 3 3623 0 20
## 4 2780 0 17
## 5 3666 0 16
## 6 3710 0 14
## 7 6141 4 12
## 8 2412b 71 10
## 9 6141 72 10
## 10 6558 1 9
## # ... with 116 more rows
# Filter the batwing set for parts that aren't in the batmobile set
batwing %>%
anti_join(batmobile, by = c("part_num"))
## # A tibble: 183 x 3
## part_num color_id quantity
## <chr> <dbl> <dbl>
## 1 11477 0 18
## 2 99207 71 18
## 3 22385 0 14
## 4 99563 0 13
## 5 10247 72 12
## 6 2877 72 12
## 7 61409 72 12
## 8 11153 0 10
## 9 98138 46 10
## 10 2419 72 9
## # ... with 173 more rows
Good work! Based on these joins, we now know that there are 126 parts in the batwing set that are also in the batmobile set, and 183 parts that are in the batwing set that aren’t in the batmobile set.
What colors are included in at least one set?
Besides comparing two sets directly, you could also use a filtering join like semi_join
to find out which colors ever appear in any inventory part. Some of the colors could be optional, meaning they aren’t included in any sets.
The inventory_parts
and colors
tables have been preloaded for you.
# Use inventory_parts to find colors included in at least one set
colors %>%
semi_join(inventory_parts, by = c("id" = "color_id"))
## # A tibble: 134 x 3
## id name rgb
## <dbl> <chr> <chr>
## 1 -1 [Unknown] #0033B2
## 2 0 Black #05131D
## 3 1 Blue #0055BF
## 4 2 Green #237841
## 5 3 Dark Turquoise #008F9B
## 6 4 Red #C91A09
## 7 5 Dark Pink #C870A0
## 8 6 Brown #583927
## 9 7 Light Gray #9BA19D
## 10 8 Dark Gray #6D6E5C
## # ... with 124 more rows
Wow! Out of the 179 colors in the colors
table, there are 134 colors that are included in at least one set.
Which set is missing version 1?
Each set included in the LEGO data has an associated version number. We want to understand the version we are looking at to learn more about the parts that are included. Before doing that, we should confirm that there aren’t any sets that are missing a particular version.
Let’s start by looking at the first version of each set to see if there are any sets that don’t include a first version.
# Use filter() to extract version 1
version_1_inventories <- inventories %>%
filter(version == 1)
# Use anti_join() to find which set is missing a version 1
sets %>%
anti_join(version_1_inventories, by = "set_num")
## # A tibble: 1 x 4
## set_num name year theme_id
## <chr> <chr> <dbl> <dbl>
## 1 40198-1 Ludo game 2018 598
Good work! This is likely a data quality issue, and anti_join is a great tool for finding problems like that.
Aggregating sets to look at their differences
To compare two individual sets, and the kinds of LEGO pieces that comprise them, we’ll need to aggregate the data into separate themes. Additionally, as we saw in the video, we’ll want to add a column so that we can understand the percentages of specific pieces that are part of each set, rather than looking at the numbers of pieces alone.
The inventory_parts_themes
table has been preloaded for you.
inventory_parts_themes <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
arrange(desc(quantity)) %>%
select(-id, -version) %>%
inner_join(sets, by = "set_num") %>%
inner_join(themes, by = c("theme_id" = "id"), suffix
inventory_parts_themes <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
arrange(desc(quantity)) %>%
select(-id, -version) %>%
inner_join(sets, by = "set_num") %>%
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
inventory_parts_themes
## # A tibble: 258,958 x 9
## set_num part_num color_id quantity name_set year theme_id name_theme
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 40179-1 3024 72 900 Persona~ 2016 277 Mosaic
## 2 40179-1 3024 15 900 Persona~ 2016 277 Mosaic
## 3 40179-1 3024 0 900 Persona~ 2016 277 Mosaic
## 4 40179-1 3024 71 900 Persona~ 2016 277 Mosaic
## 5 40179-1 3024 14 900 Persona~ 2016 277 Mosaic
## 6 k34434~ 3024 15 810 Lego Mo~ 2003 277 Mosaic
## 7 21010-1 3023 320 771 Robie H~ 2011 252 Architect~
## 8 k34431~ 3024 0 720 Lego Mo~ 2003 277 Mosaic
## 9 42083-1 2780 0 684 Bugatti~ 2018 5 Model
## 10 k34434~ 3024 0 540 Lego Mo~ 2003 277 Mosaic
## # ... with 258,948 more rows, and 1 more variable: parent_id <dbl>
batman_colors <- inventory_parts_themes %>%
# Filter the inventory_parts_themes table for the Batman theme
filter(name_theme == "Batman") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
# Add a percent column of the total divided by the sum of the total
mutate(percent = total / sum(total))
## `summarise()` ungrouping output (override with `.groups` argument)
batman_colors
## # A tibble: 57 x 3
## color_id total percent
## <dbl> <dbl> <dbl>
## 1 0 2807 0.296
## 2 1 243 0.0256
## 3 2 158 0.0167
## 4 4 529 0.0558
## 5 5 1 0.000105
## 6 10 13 0.00137
## 7 14 426 0.0449
## 8 15 404 0.0426
## 9 19 142 0.0150
## 10 25 36 0.00380
## # ... with 47 more rows
# Filter and aggregate the Star Wars set data; add a percent column
star_wars_colors <- inventory_parts_themes %>%
filter(name_theme == "Star Wars") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
mutate(percent = total / sum(total))
## `summarise()` ungrouping output (override with `.groups` argument)
star_wars_colors
## # A tibble: 52 x 3
## color_id total percent
## <dbl> <dbl> <dbl>
## 1 0 3258 0.207
## 2 1 410 0.0261
## 3 2 36 0.00229
## 4 3 25 0.00159
## 5 4 434 0.0276
## 6 6 40 0.00254
## 7 7 209 0.0133
## 8 8 51 0.00324
## 9 10 6 0.000382
## 10 14 207 0.0132
## # ... with 42 more rows
Good work! In addition to being able to view the sets for Batman and Star Wars separately, adding the colum also allowed us to be able to look at the percentage differences between the sets, rather than only being able to compare the numbers of pieces. This will become more useful as we more forward in our analysis.
Combining sets
The data you aggregated in the last exercise has been preloaded for you as batman_colors
and star_wars_colors
. Prior to visualizing the data, you’ll want to combine these tables to be able to directly compare the themes’ colors.
batman_colors <- inventory_parts_themes %>%
filter(name_theme == "Batman") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
mutate(percent = total / sum(total))
star_wars_colors <- inventory_parts_themes %>%
filter(name_theme == "Star Wars") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
mutate(percent = total / sum(total))
batman_colors <- inventory_parts_themes %>%
filter(name_theme == "Batman") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
mutate(percent = total / sum(total))
## `summarise()` ungrouping output (override with `.groups` argument)
star_wars_colors <- inventory_parts_themes %>%
filter(name_theme == "Star Wars") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
mutate(percent = total / sum(total))
## `summarise()` ungrouping output (override with `.groups` argument)
batman_colors %>%
# Join the Batman and Star Wars colors
full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
# Replace NAs in the total_batman and total_star_wars columns
replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
inner_join(colors, by = c("color_id" = "id"))
## # A tibble: 63 x 7
## color_id total_batman percent_batman total_star_wars percent_star_wa~ name
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 0 2807 0.296 3258 0.207 Black
## 2 1 243 0.0256 410 0.0261 Blue
## 3 2 158 0.0167 36 0.00229 Green
## 4 4 529 0.0558 434 0.0276 Red
## 5 5 1 0.000105 0 NA Dark~
## 6 10 13 0.00137 6 0.000382 Brig~
## 7 14 426 0.0449 207 0.0132 Yell~
## 8 15 404 0.0426 1771 0.113 White
## 9 19 142 0.0150 1012 0.0644 Tan
## 10 25 36 0.00380 36 0.00229 Oran~
## # ... with 53 more rows, and 1 more variable: rgb <chr>
batman_colors %>%
full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
inner_join(colors, by = c("color_id" = "id")) %>%
# Create the difference and total columns
mutate(difference = percent_batman - percent_star_wars,
total = total_batman + total_star_wars) %>%
# Filter for totals greater than 200
filter(total >= 200)
## # A tibble: 16 x 9
## color_id total_batman percent_batman total_star_wars percent_star_wa~ name
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 0 2807 0.296 3258 0.207 Black
## 2 1 243 0.0256 410 0.0261 Blue
## 3 4 529 0.0558 434 0.0276 Red
## 4 14 426 0.0449 207 0.0132 Yell~
## 5 15 404 0.0426 1771 0.113 White
## 6 19 142 0.0150 1012 0.0644 Tan
## 7 28 98 0.0103 183 0.0116 Dark~
## 8 36 86 0.00907 246 0.0156 Tran~
## 9 46 200 0.0211 39 0.00248 Tran~
## 10 70 297 0.0313 373 0.0237 Redd~
## 11 71 1148 0.121 3264 0.208 Ligh~
## 12 72 1453 0.153 2433 0.155 Dark~
## 13 84 278 0.0293 31 0.00197 Medi~
## 14 179 154 0.0162 232 0.0148 Flat~
## 15 378 22 0.00232 430 0.0273 Sand~
## 16 7 0 NA 209 0.0133 Ligh~
## # ... with 3 more variables: rgb <chr>, difference <dbl>, total <dbl>
Nice work! With this combined table that contains all the information from the batman_colors
and star_wars_colors
tables, you can now create an informative visualization to compare the colors of these sets.
Visualizing the difference: Batman and Star Wars
In the last exercise, you created colors_joined
. Now you’ll create a bar plot with one bar for each color (name
), showing the difference in percentages.
Because factors and visualization are beyond the scope of this course, we’ve done some processing for you: here is the code that created the colors_joined
table that will be used in the video.
colors_joined <- batman_colors %>%
full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
inner_join(colors, by = c("color_id" = "id")) %>%
mutate(difference = percent_batman - percent_star_wars,
total = total_batman + total_star_wars) %>%
filter(total >= 200) %>%
mutate(name = fct_reorder(name, difference))
library(forcats)
library(ggplot2)
colors_joined <- batman_colors %>%
full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
inner_join(colors, by = c("color_id" = "id")) %>%
mutate(difference = percent_batman - percent_star_wars,
total = total_batman + total_star_wars) %>%
filter(total >= 200) %>%
mutate(name = fct_reorder(name, difference)) %>%
drop_na()
colors_joined
## # A tibble: 15 x 9
## color_id total_batman percent_batman total_star_wars percent_star_wa~ name
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 0 2807 0.296 3258 0.207 Black
## 2 1 243 0.0256 410 0.0261 Blue
## 3 4 529 0.0558 434 0.0276 Red
## 4 14 426 0.0449 207 0.0132 Yell~
## 5 15 404 0.0426 1771 0.113 White
## 6 19 142 0.0150 1012 0.0644 Tan
## 7 28 98 0.0103 183 0.0116 Dark~
## 8 36 86 0.00907 246 0.0156 Tran~
## 9 46 200 0.0211 39 0.00248 Tran~
## 10 70 297 0.0313 373 0.0237 Redd~
## 11 71 1148 0.121 3264 0.208 Ligh~
## 12 72 1453 0.153 2433 0.155 Dark~
## 13 84 278 0.0293 31 0.00197 Medi~
## 14 179 154 0.0162 232 0.0148 Flat~
## 15 378 22 0.00232 430 0.0273 Sand~
## # ... with 3 more variables: rgb <chr>, difference <dbl>, total <dbl>
# Create a bar plot using colors_joined and the name and difference columns
color_palette <- setNames(colors_joined$rgb, colors_joined$name)
ggplot(colors_joined, aes(name, difference, fill = name)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = color_palette, guide = FALSE) +
labs(y = "Difference: Batman - Star Wars")
Great! As you can see from the plot, the Batman set has more black, yellow, and red, while the Star Wars set has more light bluish gray, white, and tan.
Left-joining questions and tags
Three of the Stack Overflow survey datasets are questions
, question_tags
, and tags
:
questions
: an ID and the score, or how many times the question has been upvoted; the data only includes R-based questionsquestion_tags
: a tag ID for each question and the question’s idtags
: a tag id and the tag’s name, which can be used to identify the subject of each question, such as ggplot2 or dplyrIn this exercise, we’ll be stitching together these datasets and replacing NAs in important fields.
Note that we’ll be using left_join
s in this exercise to ensure we keep all questions, even those without a corresponding tag. However, since we know the questions
data is all R data, we’ll want to manually tag these as R questions with replace_na
.
questions <- readRDS("_data/questions.rds")
questions
## # A tibble: 294,735 x 3
## id creation_date score
## <int> <date> <int>
## 1 22557677 2014-03-21 1
## 2 22557707 2014-03-21 2
## 3 22558084 2014-03-21 2
## 4 22558395 2014-03-21 2
## 5 22558613 2014-03-21 0
## 6 22558677 2014-03-21 2
## 7 22558887 2014-03-21 8
## 8 22559180 2014-03-21 1
## 9 22559312 2014-03-21 0
## 10 22559322 2014-03-21 2
## # ... with 294,725 more rows
question_tags <- readRDS("_data/question_tags.rds")
question_tags
## # A tibble: 497,153 x 2
## question_id tag_id
## <int> <int>
## 1 22557677 18
## 2 22557677 139
## 3 22557677 16088
## 4 22557677 1672
## 5 22558084 6419
## 6 22558084 92764
## 7 22558395 5569
## 8 22558395 134
## 9 22558395 9412
## 10 22558395 18621
## # ... with 497,143 more rows
tags <- readRDS("_data/tags.rds")
tags
## # A tibble: 48,299 x 2
## id tag_name
## <dbl> <chr>
## 1 124399 laravel-dusk
## 2 124402 spring-cloud-vault-config
## 3 124404 spring-vault
## 4 124405 apache-bahir
## 5 124407 astc
## 6 124408 simulacrum
## 7 124410 angulartics2
## 8 124411 django-rest-viewsets
## 9 124414 react-native-lightbox
## 10 124417 java-module
## # ... with 48,289 more rows
# Join the questions and question_tags tables
questions %>%
left_join(question_tags, by = c("id" = "question_id"))
## # A tibble: 545,694 x 4
## id creation_date score tag_id
## <int> <date> <int> <int>
## 1 22557677 2014-03-21 1 18
## 2 22557677 2014-03-21 1 139
## 3 22557677 2014-03-21 1 16088
## 4 22557677 2014-03-21 1 1672
## 5 22557707 2014-03-21 2 NA
## 6 22558084 2014-03-21 2 6419
## 7 22558084 2014-03-21 2 92764
## 8 22558395 2014-03-21 2 5569
## 9 22558395 2014-03-21 2 134
## 10 22558395 2014-03-21 2 9412
## # ... with 545,684 more rows
# Join in the tags table
questions %>%
left_join(question_tags, by = c("id" = "question_id")) %>%
left_join(tags, by = c("tag_id" = "id"))
## # A tibble: 545,694 x 5
## id creation_date score tag_id tag_name
## <int> <date> <int> <dbl> <chr>
## 1 22557677 2014-03-21 1 18 regex
## 2 22557677 2014-03-21 1 139 string
## 3 22557677 2014-03-21 1 16088 time-complexity
## 4 22557677 2014-03-21 1 1672 backreference
## 5 22557707 2014-03-21 2 NA <NA>
## 6 22558084 2014-03-21 2 6419 time-series
## 7 22558084 2014-03-21 2 92764 panel-data
## 8 22558395 2014-03-21 2 5569 function
## 9 22558395 2014-03-21 2 134 sorting
## 10 22558395 2014-03-21 2 9412 vectorization
## # ... with 545,684 more rows
# Replace the NAs in the tag_name column
questions %>%
left_join(question_tags, by = c("id" = "question_id")) %>%
left_join(tags, by = c("tag_id" = "id")) %>%
replace_na(list(tag_name="only-r"))
## # A tibble: 545,694 x 5
## id creation_date score tag_id tag_name
## <int> <date> <int> <dbl> <chr>
## 1 22557677 2014-03-21 1 18 regex
## 2 22557677 2014-03-21 1 139 string
## 3 22557677 2014-03-21 1 16088 time-complexity
## 4 22557677 2014-03-21 1 1672 backreference
## 5 22557707 2014-03-21 2 NA only-r
## 6 22558084 2014-03-21 2 6419 time-series
## 7 22558084 2014-03-21 2 92764 panel-data
## 8 22558395 2014-03-21 2 5569 function
## 9 22558395 2014-03-21 2 134 sorting
## 10 22558395 2014-03-21 2 9412 vectorization
## # ... with 545,684 more rows
Awesome! We now have a dataset that we can analyze after all that joining.
Comparing scores across tags
The complete dataset you created in the last exercise is available to you as questions_with_tags
. Let’s do a quick bit of analysis on it! You’ll use familiar dplyr verbs like group_by
, summarize
, arrange
, and n
to find out the average score of the most asked questions.
questions_with_tags <- questions %>%
left_join(question_tags, by = c("id" = "question_id")) %>%
left_join(tags, by = c("tag_id" = "id")) %>%
replace_na(list(tag_name="only-r"))
questions_with_tags %>%
# Group by tag_name
group_by(tag_name) %>%
# Get mean score and num_questions
summarize(score = mean(score),
num_questions = n()) %>%
# Sort num_questions in descending order
arrange(desc(num_questions))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7,841 x 3
## tag_name score num_questions
## <chr> <dbl> <int>
## 1 only-r 1.26 48541
## 2 ggplot2 2.61 28228
## 3 dataframe 2.31 18874
## 4 shiny 1.45 14219
## 5 dplyr 1.95 14039
## 6 plot 2.24 11315
## 7 data.table 2.97 8809
## 8 matrix 1.66 6205
## 9 loops 0.743 5149
## 10 regex 2 4912
## # ... with 7,831 more rows
Nice! It looks like questions with the R tag get a relatively low score, but questions with the loops tag are even lower.
What tags never appear on R questions?
The tags
table includes all Stack Overflow tags, but some have nothing to do withR
. How could you filter for just the tags that never appear on an R question? The tags
and question_tags
tables have been preloaded for you.
# Using a join, filter for tags that are never on an R question
tags %>%
anti_join(question_tags, by = c("id" = "tag_id"))
## # A tibble: 40,459 x 2
## id tag_name
## <dbl> <chr>
## 1 124399 laravel-dusk
## 2 124402 spring-cloud-vault-config
## 3 124404 spring-vault
## 4 124405 apache-bahir
## 5 124407 astc
## 6 124408 simulacrum
## 7 124410 angulartics2
## 8 124411 django-rest-viewsets
## 9 124414 react-native-lightbox
## 10 124417 java-module
## # ... with 40,449 more rows
Great! It looks like there are more than 40,000 tags that have never appeared along R!
Finding gaps between questions and answers
Now we’ll join together questions
with answers
so we can measure the time between questions and answers.
Make sure to explore the tables and columns in the console before starting the exercise. Can you tell how are questions identified in the questions
table? How can you identify which answer corresponds to which question using the answers
table?
answers <- readRDS("_data/answers.rds")
questions %>%
# Inner join questions and answers with proper suffixes
inner_join(answers, by = c("id" = "question_id"), suffix = c("_question", "_answer")) %>%
# Subtract creation_date_question from creation_date_answer to create gap
mutate(gap = as.integer(creation_date_answer - creation_date_question))
## # A tibble: 380,643 x 7
## id creation_date_q~ score_question id_answer creation_date_a~
## <int> <date> <int> <int> <date>
## 1 2.26e7 2014-03-21 1 22560670 2014-03-21
## 2 2.26e7 2014-03-21 2 22558516 2014-03-21
## 3 2.26e7 2014-03-21 2 22558726 2014-03-21
## 4 2.26e7 2014-03-21 2 22558085 2014-03-21
## 5 2.26e7 2014-03-21 2 22606545 2014-03-24
## 6 2.26e7 2014-03-21 2 22610396 2014-03-24
## 7 2.26e7 2014-03-21 2 34374729 2015-12-19
## 8 2.26e7 2014-03-21 2 22559327 2014-03-21
## 9 2.26e7 2014-03-21 2 22560102 2014-03-21
## 10 2.26e7 2014-03-21 2 22560288 2014-03-21
## # ... with 380,633 more rows, and 2 more variables: score_answer <int>,
## # gap <int>
Nice! Now we could use this information to identify how long it takes different questions to get answers.
Joining question and answer counts
We can also determine how many questions actually yield answers. If we count the number of answers for each question, we can then join the answers counts with the questions
table.
# Count and sort the question id column in the answers table
answer_counts <- answers %>%
count(question_id, sort = TRUE)
answer_counts
## # A tibble: 243,930 x 2
## question_id n
## <int> <int>
## 1 1295955 34
## 2 2547402 30
## 3 1358003 27
## 4 4090169 26
## 5 1535021 25
## 6 1189759 24
## 7 1815606 24
## 8 5963269 23
## 9 17200114 22
## 10 102056 21
## # ... with 243,920 more rows
# Combine the answer_counts and questions tables
questions %>%
left_join(answer_counts, by = c("id" = "question_id")) %>%
# Replace the NAs in the n column
replace_na(list(n = 0))
## # A tibble: 294,735 x 4
## id creation_date score n
## <int> <date> <int> <dbl>
## 1 22557677 2014-03-21 1 1
## 2 22557707 2014-03-21 2 2
## 3 22558084 2014-03-21 2 4
## 4 22558395 2014-03-21 2 3
## 5 22558613 2014-03-21 0 1
## 6 22558677 2014-03-21 2 2
## 7 22558887 2014-03-21 8 1
## 8 22559180 2014-03-21 1 1
## 9 22559312 2014-03-21 0 1
## 10 22559322 2014-03-21 2 5
## # ... with 294,725 more rows
Great! You can use this combined table to see which questions have the most answers, and which questions have no answers.
Joining questions, answers, and tags
Let’s build on the last exercise by adding the tags
table to our previous joins. This will allow us to do a better job of identifying which R topics get the most traction on Stack Overflow. The tables you created in the last exercise have been preloaded for you as answer_counts
and question_answer_counts
.
answer_counts <- answers %>%
count(question_id, sort = TRUE)
question_answer_counts <- questions %>%
left_join(answer_counts, by = c("id" = "question_id")) %>%
replace_na(list(n = 0))
answer_counts <- answers %>%
count(question_id, sort = TRUE)
question_answer_counts <- questions %>%
left_join(answer_counts, by = c("id" = "question_id")) %>%
replace_na(list(n = 0))
question_answer_counts %>%
# Join the question_tags tables
inner_join(question_tags, by = c("id" = "question_id")) %>%
# Join the tags table
inner_join(tags, by = c("tag_id" = "id"))
## # A tibble: 497,153 x 6
## id creation_date score n tag_id tag_name
## <int> <date> <int> <dbl> <dbl> <chr>
## 1 22557677 2014-03-21 1 1 18 regex
## 2 22557677 2014-03-21 1 1 139 string
## 3 22557677 2014-03-21 1 1 16088 time-complexity
## 4 22557677 2014-03-21 1 1 1672 backreference
## 5 22558084 2014-03-21 2 4 6419 time-series
## 6 22558084 2014-03-21 2 4 92764 panel-data
## 7 22558395 2014-03-21 2 3 5569 function
## 8 22558395 2014-03-21 2 3 134 sorting
## 9 22558395 2014-03-21 2 3 9412 vectorization
## 10 22558395 2014-03-21 2 3 18621 operator-precedence
## # ... with 497,143 more rows
Great! Now we have a more holistic view of how questions are answered by each tag.
Average answers by question
The table you created in the last exercise has been preloaded for you as tagged_answers
. You can use this table to determine, on average, how many answers each questions gets.
tagged_answers <- question_answer_counts %>%
inner_join(question_tags, by = c("id" = "question_id")) %>%
inner_join(tags, by = c("tag_id" = "id"))
Some of the important variables from this table include: n
, the number of answers for each question, and tag_name
, the name of each tag associated with each question.
Let’s use some of our favorite dplyr verbs to find out how many answers each question gets on average.
tagged_answers <- question_answer_counts %>%
inner_join(question_tags, by = c("id" = "question_id")) %>%
inner_join(tags, by = c("tag_id" = "id"))
tagged_answers %>%
# Aggregate by tag_name
group_by(tag_name) %>%
# Summarize questions and average_answers
summarize(questions = n(),
average_answers = mean(n)) %>%
# Sort the questions in descending order
arrange(desc(questions))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7,840 x 3
## tag_name questions average_answers
## <chr> <int> <dbl>
## 1 ggplot2 28228 1.15
## 2 dataframe 18874 1.67
## 3 shiny 14219 0.921
## 4 dplyr 14039 1.55
## 5 plot 11315 1.23
## 6 data.table 8809 1.47
## 7 matrix 6205 1.45
## 8 loops 5149 1.39
## 9 regex 4912 1.91
## 10 function 4892 1.30
## # ... with 7,830 more rows
Great! You can see if you post a question about ggplot2, on average you’ll get an answer.
Joining questions and answers with tags
To learn more about the questions
and answers
table, you’ll want to use the question_tags
table to understand the tags associated with each question that was asked, and each answer that was provided. You’ll be able to combine these tables using two inner joins on both the questions
table and the answers
table.
# Inner join the question_tags and tags tables with the questions table
questions %>%
inner_join(question_tags, by = c("id" = "question_id")) %>%
inner_join(tags, by = c("tag_id" = "id"))
## # A tibble: 497,153 x 5
## id creation_date score tag_id tag_name
## <int> <date> <int> <dbl> <chr>
## 1 22557677 2014-03-21 1 18 regex
## 2 22557677 2014-03-21 1 139 string
## 3 22557677 2014-03-21 1 16088 time-complexity
## 4 22557677 2014-03-21 1 1672 backreference
## 5 22558084 2014-03-21 2 6419 time-series
## 6 22558084 2014-03-21 2 92764 panel-data
## 7 22558395 2014-03-21 2 5569 function
## 8 22558395 2014-03-21 2 134 sorting
## 9 22558395 2014-03-21 2 9412 vectorization
## 10 22558395 2014-03-21 2 18621 operator-precedence
## # ... with 497,143 more rows
# Inner join the question_tags and tags tables with the answers table
answers %>%
inner_join(question_tags, by = "question_id") %>%
inner_join(tags, by = c("tag_id" = "id"))
## # A tibble: 625,845 x 6
## id creation_date question_id score tag_id tag_name
## <int> <date> <int> <int> <dbl> <chr>
## 1 39143935 2016-08-25 39142481 0 4240 average
## 2 39143935 2016-08-25 39142481 0 5571 summary
## 3 39144014 2016-08-25 39024390 0 85748 shiny
## 4 39144014 2016-08-25 39024390 0 83308 r-markdown
## 5 39144014 2016-08-25 39024390 0 116736 htmlwidgets
## 6 39144252 2016-08-25 39096741 6 67746 rstudio
## 7 39144375 2016-08-25 39143885 5 105113 data.table
## 8 39144430 2016-08-25 39144077 0 276 variables
## 9 39144625 2016-08-25 39142728 1 46457 dataframe
## 10 39144625 2016-08-25 39142728 1 9047 subset
## # ... with 625,835 more rows
Great! Now we will be able to combine each of these individual tables into a single cohesive table to have a better understanding of the information we have about the questions, answers, and associated tags.
Binding and counting posts with tags
The tables you created in the previous exercise have been preloaded as questions_with_tags
and answers_with_tags
. First, you’ll want to combine these tables into a single table called posts_with_tags
. Once the information is consolidated into a single table, you can add more information by creating a date variable using the lubridate
package, which has been preloaded for you.
questions_with_tags <- questions %>%
inner_join(question_tags, by = c("id" = "question_id")) %>%
inner_join(tags, by = c("tag_id" = "id"))
answers_with_tags <- answers %>%
inner_join(question_tags, by = "question_id") %>%
inner_join(tags, by = c("tag_id" = "id"))
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
questions_with_tags <- questions %>%
inner_join(question_tags, by = c("id" = "question_id")) %>%
inner_join(tags, by = c("tag_id" = "id"))
answers_with_tags <- answers %>%
inner_join(question_tags, by = "question_id") %>%
inner_join(tags, by = c("tag_id" = "id"))
# Combine the two tables into posts_with_tags
posts_with_tags <- bind_rows(questions_with_tags %>% mutate(type = "question"),
answers_with_tags %>% mutate(type = "answer"))
# Add a year column, then aggregate by type, year, and tag_name
posts_with_tags %>%
mutate(year = year(creation_date)) %>%
count(type, year, tag_name)
## # A tibble: 58,299 x 4
## type year tag_name n
## <chr> <dbl> <chr> <int>
## 1 answer 2008 bayesian 1
## 2 answer 2008 dataframe 3
## 3 answer 2008 dirichlet 1
## 4 answer 2008 eof 1
## 5 answer 2008 file 1
## 6 answer 2008 file-io 1
## 7 answer 2008 function 7
## 8 answer 2008 global-variables 7
## 9 answer 2008 math 2
## 10 answer 2008 mathematical-optimization 1
## # ... with 58,289 more rows
Great work! Now you have a single table that displays the type
, year
, tag_name
, and count as n
.
Visualizing questions and answers in tags
In the last exercise, you modified the posts_with_tags
table to add a year
column, and aggregated by type
, year
, and tag_name
. The modified table has been preloaded for you as by_type_year_tag
, and has one observation for each type (question/answer), year, and tag. Let’s create a plot to examine the information that the table contains about questions and answers for the dplyr
and ggplot2
tags. The ggplot2
package has been preloaded for you.
by_type_year_tag <- posts_with_tags %>%
mutate(year = year(creation_date)) %>%
count(type, year, tag_name)
by_type_year_tag <- posts_with_tags %>%
mutate(year = year(creation_date)) %>%
count(type, year, tag_name)
# Filter for the dplyr and ggplot2 tag names
by_type_year_tag_filtered <- by_type_year_tag %>%
filter(tag_name %in% c("dplyr", "ggplot2"))
# Create a line plot faceted by the tag name
ggplot(by_type_year_tag_filtered, aes(year, n, color = type)) +
geom_line() +
facet_wrap(~ tag_name)
Neat! Notice answers on dplyr questions are growing faster than dplyr questions themselves; meaning the average dplyr question has more answers than the average ggplot2 question.