Filtering rows
The vote
column in the dataset has a number that represents that country’s vote:
One step of data cleaning is removing observations (rows) that you’re not interested in. In this case, you want to remove “Not present” and “Not a member”.
# Load the dplyr package
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
votes <- readRDS("_data/votes.rds")
# Print the votes dataset
votes
## # A tibble: 508,929 x 4
## rcid session vote ccode
## <dbl> <dbl> <dbl> <int>
## 1 46 2 1 2
## 2 46 2 1 20
## 3 46 2 9 31
## 4 46 2 1 40
## 5 46 2 1 41
## 6 46 2 1 42
## 7 46 2 9 51
## 8 46 2 9 52
## 9 46 2 9 53
## 10 46 2 9 54
## # ... with 508,919 more rows
# Filter for votes that are "yes", "abstain", or "no"
votes %>%
filter(vote <= 3)
## # A tibble: 353,547 x 4
## rcid session vote ccode
## <dbl> <dbl> <dbl> <int>
## 1 46 2 1 2
## 2 46 2 1 20
## 3 46 2 1 40
## 4 46 2 1 41
## 5 46 2 1 42
## 6 46 2 1 70
## 7 46 2 1 90
## 8 46 2 1 91
## 9 46 2 1 92
## 10 46 2 1 93
## # ... with 353,537 more rows
Adding a year column
The next step of data cleaning is manipulating your variables (columns) to make them more informative.
In this case, you have a session
column that is hard to interpret intuitively. But since the UN started voting in 1946, and holds one session per year, you can get the year of a UN resolution by adding 1945 to the session
number.
# Add another %>% step to add a year column
votes %>%
filter(vote <= 3) %>%
mutate(year = session + 1945)
## # A tibble: 353,547 x 5
## rcid session vote ccode year
## <dbl> <dbl> <dbl> <int> <dbl>
## 1 46 2 1 2 1947
## 2 46 2 1 20 1947
## 3 46 2 1 40 1947
## 4 46 2 1 41 1947
## 5 46 2 1 42 1947
## 6 46 2 1 70 1947
## 7 46 2 1 90 1947
## 8 46 2 1 91 1947
## 9 46 2 1 92 1947
## 10 46 2 1 93 1947
## # ... with 353,537 more rows
Adding a country column
The country codes in the ccode
column are what’s called Correlates of War codes. This isn’t ideal for an analysis, since you’d like to work with recognizable country names.
You can use the countrycode
package to translate. For example:
library(countrycode)
# Translate the country code 2
> countrycode(2, "cown", "country.name")
[1] "United States"
# Translate multiple country codes
> countrycode(c(2, 20, 40), "cown", "country.name")
[1] "United States" "Canada" "Cuba"
# Load the countrycode package
library(countrycode)
# Convert country code 100
countrycode(100, "cown", "country.name")
## [1] "Colombia"
# Add a country column within the mutate: votes_processed
votes_processed <- votes %>%
filter(vote <= 3) %>%
mutate(year = session + 1945,
country = countrycode(ccode, "cown", "country.name"))
## Warning: Problem with `mutate()` input `country`.
## i Some values were not matched unambiguously: 260
##
## i Input `country` is `countrycode(ccode, "cown", "country.name")`.
## Warning in countrycode(ccode, "cown", "country.name"): Some values were not matched unambiguously: 260
Summarizing the full dataset
In this analysis, you’re going to focus on “% of votes that are yes” as a metric for the “agreeableness” of countries.
You’ll start by finding this summary for the entire dataset: the fraction of all votes in their history that were “yes”. Note that within your call to summarize()
, you can use n()
to find the total number of votes and mean(vote == 1)
to find the fraction of “yes” votes.
# Print votes_processed
votes_processed
## # A tibble: 353,547 x 6
## rcid session vote ccode year country
## <dbl> <dbl> <dbl> <int> <dbl> <chr>
## 1 46 2 1 2 1947 United States
## 2 46 2 1 20 1947 Canada
## 3 46 2 1 40 1947 Cuba
## 4 46 2 1 41 1947 Haiti
## 5 46 2 1 42 1947 Dominican Republic
## 6 46 2 1 70 1947 Mexico
## 7 46 2 1 90 1947 Guatemala
## 8 46 2 1 91 1947 Honduras
## 9 46 2 1 92 1947 El Salvador
## 10 46 2 1 93 1947 Nicaragua
## # ... with 353,537 more rows
# Find total and fraction of "yes" votes
votes_processed %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
## # A tibble: 1 x 2
## total percent_yes
## <int> <dbl>
## 1 353547 0.800
Summarizing by year
The summarize()
function is especially useful because it can be used within groups.
For example, you might like to know how much the average “agreeableness” of countries changed from year to year. To examine this, you can use group_by()
to perform your summary not for the entire dataset, but within each year.
# Change this code to summarize by year
votes_processed %>%
group_by(year) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 34 x 3
## year total percent_yes
## <dbl> <int> <dbl>
## 1 1947 2039 0.569
## 2 1949 3469 0.438
## 3 1951 1434 0.585
## 4 1953 1537 0.632
## 5 1955 2169 0.695
## 6 1957 2708 0.609
## 7 1959 4326 0.588
## 8 1961 7482 0.573
## 9 1963 3308 0.729
## 10 1965 4382 0.708
## # ... with 24 more rows
Nice one! The group_by()
function must go before your call to summarize()
when you’re trying to perform your summary within groups.
Summarizing by country
In the last exercise, you performed a summary of the votes within each year. You could instead summarize()
within each country, which would let you compare voting patterns between countries.
# Summarize by country: by_country
by_country <- votes_processed %>%
group_by(country) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
## `summarise()` ungrouping output (override with `.groups` argument)
Sorting by percentage of “yes” votes
Now that you’ve summarized the dataset by country, you can start examining it and answering interesting questions.
For example, you might be especially interested in the countries that voted “yes” least often, or the ones that voted “yes” most often.
# You have the votes summarized by country
by_country <- votes_processed %>%
group_by(country) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
## `summarise()` ungrouping output (override with `.groups` argument)
# Print the by_country dataset
by_country
## # A tibble: 200 x 3
## country total percent_yes
## <chr> <int> <dbl>
## 1 Afghanistan 2373 0.859
## 2 Albania 1695 0.717
## 3 Algeria 2213 0.899
## 4 Andorra 719 0.638
## 5 Angola 1431 0.924
## 6 Antigua & Barbuda 1302 0.912
## 7 Argentina 2553 0.768
## 8 Armenia 758 0.747
## 9 Australia 2575 0.557
## 10 Austria 2389 0.622
## # ... with 190 more rows
# Sort in ascending order of percent_yes
by_country %>%
arrange(percent_yes)
## # A tibble: 200 x 3
## country total percent_yes
## <chr> <int> <dbl>
## 1 Zanzibar 2 0
## 2 United States 2568 0.269
## 3 Palau 369 0.339
## 4 Israel 2380 0.341
## 5 <NA> 1075 0.397
## 6 United Kingdom 2558 0.417
## 7 France 2527 0.427
## 8 Micronesia (Federated States of) 724 0.442
## 9 Marshall Islands 757 0.491
## 10 Belgium 2568 0.492
## # ... with 190 more rows
# Now sort in descending order
by_country %>%
arrange(desc(percent_yes))
## # A tibble: 200 x 3
## country total percent_yes
## <chr> <int> <dbl>
## 1 São Tomé & Príncipe 1091 0.976
## 2 Seychelles 881 0.975
## 3 Djibouti 1598 0.961
## 4 Guinea-Bissau 1538 0.960
## 5 Timor-Leste 326 0.957
## 6 Mauritius 1831 0.950
## 7 Zimbabwe 1361 0.949
## 8 Comoros 1133 0.947
## 9 United Arab Emirates 1934 0.947
## 10 Mozambique 1701 0.947
## # ... with 190 more rows
Filtering summarized output
In the last exercise, you may have noticed that the country that voted least frequently, Zanzibar, had only 2 votes in the entire dataset. You certainly can’t make any substantial conclusions based on that data!
Typically in a progressive analysis, when you find that a few of your observations have very little data while others have plenty, you set some threshold to filter them out.
# Filter out countries with fewer than 100 votes
by_country %>%
arrange(percent_yes) %>%
filter(total >= 100)
## # A tibble: 197 x 3
## country total percent_yes
## <chr> <int> <dbl>
## 1 United States 2568 0.269
## 2 Palau 369 0.339
## 3 Israel 2380 0.341
## 4 <NA> 1075 0.397
## 5 United Kingdom 2558 0.417
## 6 France 2527 0.427
## 7 Micronesia (Federated States of) 724 0.442
## 8 Marshall Islands 757 0.491
## 9 Belgium 2568 0.492
## 10 Canada 2576 0.508
## # ... with 187 more rows
Plotting a line over time
In the last chapter, you learned how to summarize()
the votes dataset by year, particularly the percentage of votes in each year that were “yes”.
You’ll now use the ggplot2
package to turn your results into a visualization of the percentage of “yes” votes over time.
# Define by_year
by_year <- votes_processed %>%
group_by(year) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
## `summarise()` ungrouping output (override with `.groups` argument)
# Load the ggplot2 package
library(ggplot2)
# Create line plot
ggplot(by_year, aes(year, percent_yes)) +
geom_line()
Other ggplot2 layers
A line plot is one way to display this data. You could also choose to display it as a scatter plot, with each year represented as a single point. This requires changing the layer (i.e. geom_line()
to geom_point()
).
You can also add additional layers to your graph, such as a smoothing curve with geom_smooth()
.
# Change to scatter plot and add smoothing curve
ggplot(by_year, aes(year, percent_yes)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Summarizing by year and country
You’re more interested in trends of voting within specific countries than you are in the overall trend. So instead of summarizing just by year, summarize by both year and country, constructing a dataset that shows what fraction of the time each country votes “yes” in each year.
# Group by year and country: by_year_country
by_year_country <- votes_processed %>%
group_by(year, country) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
Awesome! Let’s make some plots using this new dataset in the next exercise.
Plotting just the UK over time
Now that you have the percentage of time that each country voted “yes” within each year, you can plot the trend for a particular country. In this case, you’ll look at the trend for just the United Kingdom.
This will involve using filter()
on your data before giving it to ggplot2
.
# Start with by_year_country dataset
by_year_country <- votes_processed %>%
group_by(year, country) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
# Print by_year_country
by_year_country
## # A tibble: 4,744 x 4
## # Groups: year [34]
## year country total percent_yes
## <dbl> <chr> <int> <dbl>
## 1 1947 Afghanistan 34 0.382
## 2 1947 Argentina 38 0.579
## 3 1947 Australia 38 0.553
## 4 1947 Belarus 38 0.5
## 5 1947 Belgium 38 0.605
## 6 1947 Bolivia 37 0.595
## 7 1947 Brazil 38 0.658
## 8 1947 Canada 38 0.605
## 9 1947 Chile 38 0.658
## 10 1947 Colombia 35 0.543
## # ... with 4,734 more rows
# Create a filtered version: UK_by_year
UK_by_year <- by_year_country %>%
filter(country == "United Kingdom")
# Line plot of percent_yes over time for UK only
ggplot(UK_by_year, aes(year, percent_yes)) +
geom_line()
Plotting multiple countries
Plotting just one country at a time is interesting, but you really want to compare trends between countries. For example, suppose you want to compare voting trends for the United States, the UK, France, and India.
You’ll have to filter to include all four of these countries and use another aesthetic (not just x- and y-axes) to distinguish the countries on the resulting visualization. Instead, you’ll use the color aesthetic to represent different countries.
# Vector of four countries to examine
countries <- c("United States", "United Kingdom",
"France", "India")
# Filter by_year_country: filtered_4_countries
filtered_4_countries <- by_year_country %>%
filter(country %in% countries)
# Line plot of % yes in four countries
ggplot(filtered_4_countries, aes(year, percent_yes, color = country)) +
geom_line()
Faceting the time series
Now you’ll take a look at six countries. While in the previous exercise you used color to represent distinct countries, this gets a little too crowded with six.
Instead, you will facet, giving each country its own sub-plot. To do so, you add a facet_wrap()
step after all of your layers.
# Vector of six countries to examine
countries <- c("United States", "United Kingdom",
"France", "Japan", "Brazil", "India")
# Filtered by_year_country: filtered_6_countries
filtered_6_countries <- by_year_country %>%
filter(country %in% countries)
# Line plot of % yes over time faceted by country
ggplot(filtered_6_countries, aes(year, percent_yes)) +
geom_line() +
facet_wrap(~ country)
Faceting with free y-axis
In the previous plot, all six graphs had the same axis limits. This made the changes over time hard to examine for plots with relatively little change.
Instead, you may want to let the plot choose a different y-axis for each facet.
# Vector of six countries to examine
countries <- c("United States", "United Kingdom",
"France", "Japan", "Brazil", "India")
# Filtered by_year_country: filtered_6_countries
filtered_6_countries <- by_year_country %>%
filter(country %in% countries)
# Line plot of % yes over time faceted by country
ggplot(filtered_6_countries, aes(year, percent_yes)) +
geom_line() +
facet_wrap(~ country, scales = "free_y")
Choose your own countries The purpose of an exploratory data analysis is to ask questions and answer them with data. Now it’s your turn to ask the questions.
You’ll choose some countries whose history you are interested in and add them to the graph. If you want to look up the full list of countries, enter by_country$country
in the console.
# Add three more countries to this list
countries <- c("United States", "United Kingdom",
"France", "Japan", "Brazil", "India",
"Russia", "China", "North Korea")
# Filtered by_year_country: filtered_countries
filtered_countries <- by_year_country %>%
filter(country %in% countries)
# Line plot of % yes over time faceted by country
ggplot(filtered_countries, aes(year, percent_yes)) +
geom_line() +
facet_wrap(~ country, scales = "free_y")
Linear regression on the United States
A linear regression is a model that lets us examine how one variable changes with respect to another by fitting a best fit line. It is done with the lm()
function in R.
Here, you’ll fit a linear regression to just the percentage of “yes” votes from the United States.
# Percentage of yes votes from the US by year: US_by_year
US_by_year <- by_year_country %>%
filter(country == "United States")
# Print the US_by_year data
US_by_year
## # A tibble: 34 x 4
## # Groups: year [34]
## year country total percent_yes
## <dbl> <chr> <int> <dbl>
## 1 1947 United States 38 0.711
## 2 1949 United States 64 0.281
## 3 1951 United States 25 0.4
## 4 1953 United States 26 0.5
## 5 1955 United States 37 0.622
## 6 1957 United States 34 0.647
## 7 1959 United States 54 0.426
## 8 1961 United States 75 0.507
## 9 1963 United States 32 0.5
## 10 1965 United States 41 0.366
## # ... with 24 more rows
# Perform a linear regression of percent_yes by year: US_fit
US_fit <- lm(percent_yes ~ year, US_by_year)
# Perform summary() on the US_fit object
summary(US_fit)
##
## Call:
## lm(formula = percent_yes ~ year, data = US_by_year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.222491 -0.080635 -0.008661 0.081948 0.194307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.6641455 1.8379743 6.890 8.48e-08 ***
## year -0.0062393 0.0009282 -6.722 1.37e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1062 on 32 degrees of freedom
## Multiple R-squared: 0.5854, Adjusted R-squared: 0.5724
## F-statistic: 45.18 on 1 and 32 DF, p-value: 1.367e-07
Tidying a linear regression model
In the last section, you fit a linear model. Now, you’ll use the tidy()
function in the broom
package to turn that model into a tidy data frame.
# Load the broom package
library(broom)
# Call the tidy() function on the US_fit object
tidy(US_fit)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 12.7 1.84 6.89 0.0000000848
## 2 year -0.00624 0.000928 -6.72 0.000000137
Combining models for multiple countries
One important advantage of changing models to tidied data frames is that they can be combined.
In an earlier section, you fit a linear model to the percentage of “yes” votes for each year in the United States. Now you’ll fit the same model for the United Kingdom and combine the results from both countries.
# Linear regression of percent_yes by year for US
US_by_year <- by_year_country %>%
filter(country == "United States")
US_fit <- lm(percent_yes ~ year, US_by_year)
# Fit model for the United Kingdom
UK_by_year <- by_year_country %>%
filter(country == "United Kingdom")
UK_fit <- lm(percent_yes ~ year, UK_by_year)
# Create US_tidied and UK_tidied
US_tidied <- tidy(US_fit)
UK_tidied <- tidy(UK_fit)
# Combine the two tidied models
bind_rows(US_tidied, UK_tidied)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 12.7 1.84 6.89 0.0000000848
## 2 year -0.00624 0.000928 -6.72 0.000000137
## 3 (Intercept) -3.27 1.96 -1.67 0.105
## 4 year 0.00187 0.000989 1.89 0.0677
Awesome! We can easily compare the two models now.
Nesting a data frame
Right now, the by_year_country
data frame has one row per country-vote pair. So that you can model each country individually, you’re going to “nest” all columns besides country
, which will result in a data frame with one row per country. The data for each individual country will then be stored in a list column called data
.
# Load the tidyr package
library(tidyr)
# Nest all columns besides country
by_year_country %>%
nest(-country)
## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?
## # A tibble: 200 x 2
## country data
## <chr> <list>
## 1 Afghanistan <tibble [34 x 3]>
## 2 Argentina <tibble [34 x 3]>
## 3 Australia <tibble [34 x 3]>
## 4 Belarus <tibble [34 x 3]>
## 5 Belgium <tibble [34 x 3]>
## 6 Bolivia <tibble [34 x 3]>
## 7 Brazil <tibble [34 x 3]>
## 8 Canada <tibble [34 x 3]>
## 9 Chile <tibble [34 x 3]>
## 10 Colombia <tibble [34 x 3]>
## # ... with 190 more rows
List columns
This “nested” data has an interesting structure. The second column, data
, is a list, a type of R object that hasn’t yet come up in this course that allows complicated objects to be stored within each row. This is because each item of the data
column is itself a data frame.
# A tibble: 200 × 2
country data
<chr> <list>
1 Afghanistan <tibble [34 × 3]>
2 Argentina <tibble [34 × 3]>
3 Australia <tibble [34 × 3]>
4 Belarus <tibble [34 × 3]>
5 Belgium <tibble [34 × 3]>
6 Bolivia, Plurinational State of <tibble [34 × 3]>
7 Brazil <tibble [34 × 3]>
8 Canada <tibble [34 × 3]>
9 Chile <tibble [34 × 3]>
10 Colombia <tibble [34 × 3]>
You can use nested$data
to access this list column and double brackets to access a particular element. For example, nested$data[[1]]
would give you the data frame with Afghanistan’s voting history (thepercent_yes
per year), since Afghanistan is the first row of the table.
# All countries are nested besides country
nested <- by_year_country %>%
nest(-country)
## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?
# Print the nested data for Brazil
nested$data[[7]]
## # A tibble: 34 x 3
## # Groups: year [34]
## year total percent_yes
## <dbl> <int> <dbl>
## 1 1947 38 0.658
## 2 1949 64 0.469
## 3 1951 25 0.64
## 4 1953 26 0.731
## 5 1955 37 0.730
## 6 1957 34 0.735
## 7 1959 54 0.537
## 8 1961 76 0.553
## 9 1963 32 0.781
## 10 1965 41 0.610
## # ... with 24 more rows
Unnesting
The opposite of the nest()
operation is the unnest()
operation. This takes each of the data frames in the list column and brings those rows back to the main data frame.
In this exercise, you are just undoing the nest()
operation. In the next section, you’ll learn how to fit a model in between these nesting and unnesting steps that makes this process useful.
# All countries are nested besides country
nested <- by_year_country %>%
nest(-country)
## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?
# Unnest the data column to return it to its original form
nested %>%
unnest(data)
## # A tibble: 4,744 x 4
## country year total percent_yes
## <chr> <dbl> <int> <dbl>
## 1 Afghanistan 1947 34 0.382
## 2 Afghanistan 1949 51 0.608
## 3 Afghanistan 1951 25 0.76
## 4 Afghanistan 1953 26 0.769
## 5 Afghanistan 1955 37 0.730
## 6 Afghanistan 1957 34 0.529
## 7 Afghanistan 1959 54 0.611
## 8 Afghanistan 1961 76 0.605
## 9 Afghanistan 1963 32 0.781
## 10 Afghanistan 1965 40 0.85
## # ... with 4,734 more rows
map() applies an operation to each item in a list
Performing linear regression on each nested dataset
Now that you’ve divided the data for each country into a separate dataset in the data
column, you need to fit a linear model to each of these datasets.
The map()
function from purrr
works by applying a formula to each item in a list, where .
represents the individual item. For example, you could add one to each of a list of numbers:
map(numbers, ~ 1 + .)
This means that to fit a model to each dataset, you can do:
map(data, ~ lm(percent_yes ~ year, data = .))
where .
represents each individual item from the data
column in by_year_country
. Recall that each item in the data
column is a dataset that pertains to a specific country.
# Load tidyr and purrr
library(tidyr)
library(purrr)
# Perform a linear regression on each item in the data column
by_year_country %>%
nest(-country) %>%
mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)))
## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?
## # A tibble: 200 x 3
## country data model
## <chr> <list> <list>
## 1 Afghanistan <tibble [34 x 3]> <lm>
## 2 Argentina <tibble [34 x 3]> <lm>
## 3 Australia <tibble [34 x 3]> <lm>
## 4 Belarus <tibble [34 x 3]> <lm>
## 5 Belgium <tibble [34 x 3]> <lm>
## 6 Bolivia <tibble [34 x 3]> <lm>
## 7 Brazil <tibble [34 x 3]> <lm>
## 8 Canada <tibble [34 x 3]> <lm>
## 9 Chile <tibble [34 x 3]> <lm>
## 10 Colombia <tibble [34 x 3]> <lm>
## # ... with 190 more rows
Tidy each linear regression model
You’ve now performed a linear regression on each nested dataset and have a linear model stored in the list column model
. But you can’t recombine the models until you’ve tidied each into a table of coefficients. To do that, you’ll need to use map()
one more time and the tidy()
function from the broom
package.
Recall that you can simply give a function to map()
(e.g. map(models, tidy)
) in order to apply that function to each item of a list.
# Load the broom package
library(broom)
# Add another mutate that applies tidy() to each model
by_year_country %>%
nest(-country) %>%
mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
tidied = map(model, tidy))
## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?
## # A tibble: 200 x 4
## country data model tidied
## <chr> <list> <list> <list>
## 1 Afghanistan <tibble [34 x 3]> <lm> <tibble [2 x 5]>
## 2 Argentina <tibble [34 x 3]> <lm> <tibble [2 x 5]>
## 3 Australia <tibble [34 x 3]> <lm> <tibble [2 x 5]>
## 4 Belarus <tibble [34 x 3]> <lm> <tibble [2 x 5]>
## 5 Belgium <tibble [34 x 3]> <lm> <tibble [2 x 5]>
## 6 Bolivia <tibble [34 x 3]> <lm> <tibble [2 x 5]>
## 7 Brazil <tibble [34 x 3]> <lm> <tibble [2 x 5]>
## 8 Canada <tibble [34 x 3]> <lm> <tibble [2 x 5]>
## 9 Chile <tibble [34 x 3]> <lm> <tibble [2 x 5]>
## 10 Colombia <tibble [34 x 3]> <lm> <tibble [2 x 5]>
## # ... with 190 more rows
Unnesting a data frame
You now have a tidied version of each model stored in the tidied
column. You want to combine all of those into a large data frame, similar to how you combined the US and UK tidied models earlier. Recall that the unnest()
function from tidyr
achieves this.
# Add one more step that unnests the tidied column
country_coefficients <- by_year_country %>%
nest(-country) %>%
mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
tidied = map(model, tidy)) %>%
unnest(tidied)
## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?
# Print the resulting country_coefficients variable
country_coefficients
## # A tibble: 400 x 8
## country data model term estimate std.error statistic p.value
## <chr> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanis~ <tibble [3~ <lm> (Interc~ -1.11e+1 1.47 -7.52 1.44e-8
## 2 Afghanis~ <tibble [3~ <lm> year 6.01e-3 0.000743 8.09 3.06e-9
## 3 Argentina <tibble [3~ <lm> (Interc~ -9.46e+0 2.10 -4.50 8.32e-5
## 4 Argentina <tibble [3~ <lm> year 5.15e-3 0.00106 4.85 3.05e-5
## 5 Australia <tibble [3~ <lm> (Interc~ -4.55e+0 2.15 -2.12 4.22e-2
## 6 Australia <tibble [3~ <lm> year 2.57e-3 0.00108 2.37 2.42e-2
## 7 Belarus <tibble [3~ <lm> (Interc~ -7.00e+0 1.50 -4.66 5.33e-5
## 8 Belarus <tibble [3~ <lm> year 3.91e-3 0.000759 5.15 1.28e-5
## 9 Belgium <tibble [3~ <lm> (Interc~ -5.85e+0 1.52 -3.86 5.22e-4
## 10 Belgium <tibble [3~ <lm> year 3.20e-3 0.000765 4.19 2.07e-4
## # ... with 390 more rows
Filtering model terms
You currently have both the intercept and slope terms for each by-country model. You’re probably more interested in how each is changing over time, so you want to focus on the slope terms.
# Print the country_coefficients dataset
country_coefficients
## # A tibble: 400 x 8
## country data model term estimate std.error statistic p.value
## <chr> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanis~ <tibble [3~ <lm> (Interc~ -1.11e+1 1.47 -7.52 1.44e-8
## 2 Afghanis~ <tibble [3~ <lm> year 6.01e-3 0.000743 8.09 3.06e-9
## 3 Argentina <tibble [3~ <lm> (Interc~ -9.46e+0 2.10 -4.50 8.32e-5
## 4 Argentina <tibble [3~ <lm> year 5.15e-3 0.00106 4.85 3.05e-5
## 5 Australia <tibble [3~ <lm> (Interc~ -4.55e+0 2.15 -2.12 4.22e-2
## 6 Australia <tibble [3~ <lm> year 2.57e-3 0.00108 2.37 2.42e-2
## 7 Belarus <tibble [3~ <lm> (Interc~ -7.00e+0 1.50 -4.66 5.33e-5
## 8 Belarus <tibble [3~ <lm> year 3.91e-3 0.000759 5.15 1.28e-5
## 9 Belgium <tibble [3~ <lm> (Interc~ -5.85e+0 1.52 -3.86 5.22e-4
## 10 Belgium <tibble [3~ <lm> year 3.20e-3 0.000765 4.19 2.07e-4
## # ... with 390 more rows
# Filter for only the slope terms
country_coefficients %>%
filter(term == "year")
## # A tibble: 200 x 8
## country data model term estimate std.error statistic p.value
## <chr> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanist~ <tibble [34 ~ <lm> year 0.00601 0.000743 8.09 3.06e-9
## 2 Argentina <tibble [34 ~ <lm> year 0.00515 0.00106 4.85 3.05e-5
## 3 Australia <tibble [34 ~ <lm> year 0.00257 0.00108 2.37 2.42e-2
## 4 Belarus <tibble [34 ~ <lm> year 0.00391 0.000759 5.15 1.28e-5
## 5 Belgium <tibble [34 ~ <lm> year 0.00320 0.000765 4.19 2.07e-4
## 6 Bolivia <tibble [34 ~ <lm> year 0.00580 0.000966 6.01 1.06e-6
## 7 Brazil <tibble [34 ~ <lm> year 0.00611 0.000817 7.48 1.64e-8
## 8 Canada <tibble [34 ~ <lm> year 0.00152 0.000955 1.59 1.22e-1
## 9 Chile <tibble [34 ~ <lm> year 0.00678 0.000822 8.24 2.05e-9
## 10 Colombia <tibble [34 ~ <lm> year 0.00616 0.000965 6.38 3.58e-7
## # ... with 190 more rows
Filtering for significant countries
Not all slopes are significant, and you can use the p-value to guess which are and which are not.
However, when you have lots of p-values, like one for each country, you run into the problem of multiple hypothesis testing, where you have to set a stricter threshold. The p.adjust() function is a simple way to correct for this, where p.adjust(p.value)
on a vector of p-values returns a set that you can trust.
Here you’ll add two steps to process the slope_terms
dataset: use a mutate
to create the new, adjusted p-value column, and filter
to filter for those below a .05 threshold.
# Filter for only the slope terms
slope_terms <- country_coefficients %>%
filter(term == "year")
# Add p.adjusted column, then filter
slope_terms %>%
mutate(p.adjusted = p.adjust(p.value)) %>%
filter(p.adjusted < .05)
## # A tibble: 61 x 9
## country data model term estimate std.error statistic p.value p.adjusted
## <chr> <list> <lis> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghani~ <tibbl~ <lm> year 0.00601 0.000743 8.09 3.06e-9 5.95e-7
## 2 Argenti~ <tibbl~ <lm> year 0.00515 0.00106 4.85 3.05e-5 4.81e-3
## 3 Belarus <tibbl~ <lm> year 0.00391 0.000759 5.15 1.28e-5 2.08e-3
## 4 Belgium <tibbl~ <lm> year 0.00320 0.000765 4.19 2.07e-4 3.01e-2
## 5 Bolivia <tibbl~ <lm> year 0.00580 0.000966 6.01 1.06e-6 1.88e-4
## 6 Brazil <tibbl~ <lm> year 0.00611 0.000817 7.48 1.64e-8 3.12e-6
## 7 Chile <tibbl~ <lm> year 0.00678 0.000822 8.24 2.05e-9 3.99e-7
## 8 Colombia <tibbl~ <lm> year 0.00616 0.000965 6.38 3.58e-7 6.56e-5
## 9 Costa R~ <tibbl~ <lm> year 0.00654 0.000812 8.05 3.39e-9 6.54e-7
## 10 Cuba <tibbl~ <lm> year 0.00461 0.000721 6.40 3.43e-7 6.31e-5
## # ... with 51 more rows
Great work! Notice that there are now only 61 countries with significant trends.
Sorting by slope
Now that you’ve filtered for countries where the trend is probably not due to chance, you may be interested in countries whose percentage of “yes” votes is changing most quickly over time. Thus, you want to find the countries with the highest and lowest slopes; that is, the estimate
column.
# Filter by adjusted p-values
filtered_countries <- country_coefficients %>%
filter(term == "year") %>%
mutate(p.adjusted = p.adjust(p.value)) %>%
filter(p.adjusted < .05)
# Sort for the countries increasing most quickly
filtered_countries %>%
arrange(desc(estimate))
## # A tibble: 61 x 9
## country data model term estimate std.error statistic p.value p.adjusted
## <chr> <list> <lis> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 South Af~ <tibb~ <lm> year 0.0119 0.00140 8.47 1.60e- 8 3.05e-6
## 2 Kazakhst~ <tibb~ <lm> year 0.0110 0.00195 5.62 3.24e- 4 4.51e-2
## 3 Yemen Ar~ <tibb~ <lm> year 0.0109 0.00159 6.84 1.20e- 6 2.11e-4
## 4 Kyrgyzst~ <tibb~ <lm> year 0.00973 0.000988 9.84 2.38e- 5 3.78e-3
## 5 Malawi <tibb~ <lm> year 0.00908 0.00181 5.02 4.48e- 5 7.03e-3
## 6 Dominica~ <tibb~ <lm> year 0.00806 0.000914 8.81 5.96e-10 1.17e-7
## 7 Portugal <tibb~ <lm> year 0.00802 0.00171 4.68 7.13e- 5 1.10e-2
## 8 Honduras <tibb~ <lm> year 0.00772 0.000921 8.38 1.43e- 9 2.81e-7
## 9 Peru <tibb~ <lm> year 0.00730 0.000976 7.48 1.65e- 8 3.12e-6
## 10 Nicaragua <tibb~ <lm> year 0.00708 0.00107 6.60 1.92e- 7 3.55e-5
## # ... with 51 more rows
# Sort for the countries decreasing most quickly
filtered_countries %>%
arrange(estimate)
## # A tibble: 61 x 9
## country data model term estimate std.error statistic p.value p.adjusted
## <chr> <list> <lis> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 South Ko~ <tibbl~ <lm> year -0.00921 0.00155 -5.96 1.39e-4 0.0209
## 2 Israel <tibbl~ <lm> year -0.00685 0.00117 -5.85 1.89e-6 0.000331
## 3 United S~ <tibbl~ <lm> year -0.00624 0.000928 -6.72 1.37e-7 0.0000254
## 4 Belgium <tibbl~ <lm> year 0.00320 0.000765 4.19 2.07e-4 0.0301
## 5 Guinea <tibbl~ <lm> year 0.00362 0.000833 4.35 1.87e-4 0.0275
## 6 Morocco <tibbl~ <lm> year 0.00380 0.000860 4.42 1.46e-4 0.0218
## 7 Belarus <tibbl~ <lm> year 0.00391 0.000759 5.15 1.28e-5 0.00208
## 8 Iran <tibbl~ <lm> year 0.00391 0.000856 4.57 6.91e-5 0.0107
## 9 Congo - ~ <tibbl~ <lm> year 0.00397 0.000922 4.30 2.27e-4 0.0326
## 10 Sudan <tibbl~ <lm> year 0.00399 0.000961 4.15 2.98e-4 0.0420
## # ... with 51 more rows
Joining datasets with inner_join
In the first chapter, you created the votes_processed
dataset, containing information about each country’s votes. You’ll now combine that with the new descriptions
dataset, which includes topic information about each country, so that you can analyze votes within particular topics.
To do this, you’ll make use of the inner_join()
function from dplyr
.
# Load dplyr package
library(dplyr)
descriptions <- readRDS("_data/descriptions.rds")
# Print the votes_processed dataset
votes_processed
## # A tibble: 353,547 x 6
## rcid session vote ccode year country
## <dbl> <dbl> <dbl> <int> <dbl> <chr>
## 1 46 2 1 2 1947 United States
## 2 46 2 1 20 1947 Canada
## 3 46 2 1 40 1947 Cuba
## 4 46 2 1 41 1947 Haiti
## 5 46 2 1 42 1947 Dominican Republic
## 6 46 2 1 70 1947 Mexico
## 7 46 2 1 90 1947 Guatemala
## 8 46 2 1 91 1947 Honduras
## 9 46 2 1 92 1947 El Salvador
## 10 46 2 1 93 1947 Nicaragua
## # ... with 353,537 more rows
# Print the descriptions dataset
descriptions
## # A tibble: 2,589 x 10
## rcid session date unres me nu di hr co ec
## <dbl> <dbl> <dttm> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 46 2 1947-09-04 00:00:00 R/2/299 0 0 0 0 0 0
## 2 47 2 1947-10-05 00:00:00 R/2/355 0 0 0 1 0 0
## 3 48 2 1947-10-06 00:00:00 R/2/461 0 0 0 0 0 0
## 4 49 2 1947-10-06 00:00:00 R/2/463 0 0 0 0 0 0
## 5 50 2 1947-10-06 00:00:00 R/2/465 0 0 0 0 0 0
## 6 51 2 1947-10-02 00:00:00 R/2/561 0 0 0 0 1 0
## 7 52 2 1947-11-06 00:00:00 R/2/650 0 0 0 0 1 0
## 8 53 2 1947-11-06 00:00:00 R/2/651 0 0 0 0 1 0
## 9 54 2 1947-11-06 00:00:00 R/2/651 0 0 0 0 1 0
## 10 55 2 1947-11-06 00:00:00 R/2/667 0 0 0 0 1 0
## # ... with 2,579 more rows
# Join them together based on the "rcid" and "session" columns
votes_joined <- votes_processed %>%
inner_join(descriptions, by = c("rcid", "session"))
Filtering the joined dataset
There are six columns in the descriptions
dataset (and therefore in the new joined dataset) that describe the topic of a resolution:
Each contains a 1
if the resolution is related to this topic and a 0
otherwise.
# Filter for votes related to colonialism
votes_joined %>%
filter(co == 1)
## # A tibble: 60,962 x 14
## rcid session vote ccode year country date unres me nu
## <dbl> <dbl> <dbl> <int> <dbl> <chr> <dttm> <chr> <dbl> <dbl>
## 1 51 2 3 2 1947 United~ 1947-10-02 00:00:00 R/2/~ 0 0
## 2 51 2 3 20 1947 Canada 1947-10-02 00:00:00 R/2/~ 0 0
## 3 51 2 2 40 1947 Cuba 1947-10-02 00:00:00 R/2/~ 0 0
## 4 51 2 1 41 1947 Haiti 1947-10-02 00:00:00 R/2/~ 0 0
## 5 51 2 3 42 1947 Domini~ 1947-10-02 00:00:00 R/2/~ 0 0
## 6 51 2 2 70 1947 Mexico 1947-10-02 00:00:00 R/2/~ 0 0
## 7 51 2 2 90 1947 Guatem~ 1947-10-02 00:00:00 R/2/~ 0 0
## 8 51 2 2 92 1947 El Sal~ 1947-10-02 00:00:00 R/2/~ 0 0
## 9 51 2 3 93 1947 Nicara~ 1947-10-02 00:00:00 R/2/~ 0 0
## 10 51 2 2 95 1947 Panama 1947-10-02 00:00:00 R/2/~ 0 0
## # ... with 60,952 more rows, and 4 more variables: di <dbl>, hr <dbl>,
## # co <dbl>, ec <dbl>
Visualizing colonialism votes
In an earlier exercise, you graphed the percentage of votes each year where the US voted “yes”. Now you’ll create that same graph, but only for votes related to colonialism.
# Load the ggplot2 package
library(ggplot2)
# Filter, then summarize by year: US_co_by_year
US_co_by_year <- votes_joined %>%
filter(country == "United States", co == 1) %>%
group_by(year) %>%
summarize(percent_yes = mean(vote == 1))
## `summarise()` ungrouping output (override with `.groups` argument)
# Graph the % of "yes" votes over time
ggplot(US_co_by_year, aes(year, percent_yes)) +
geom_line()
Using gather to tidy a dataset
In order to represent the joined vote-topic data in a tidy form so we can analyze and graph by topic, we need to transform the data so that each row has one combination of country-vote-topic. This will change the data from having six columns (me
, nu
, di
, hr
, co
, ec
) to having two columns (topic
and has_topic
).
# Load the tidyr package
library(tidyr)
# Gather the six me/nu/di/hr/co/ec columns
votes_joined %>%
gather(topic, has_topic, me:ec)
## # A tibble: 2,121,282 x 10
## rcid session vote ccode year country date unres topic
## <dbl> <dbl> <dbl> <int> <dbl> <chr> <dttm> <chr> <chr>
## 1 46 2 1 2 1947 United~ 1947-09-04 00:00:00 R/2/~ me
## 2 46 2 1 20 1947 Canada 1947-09-04 00:00:00 R/2/~ me
## 3 46 2 1 40 1947 Cuba 1947-09-04 00:00:00 R/2/~ me
## 4 46 2 1 41 1947 Haiti 1947-09-04 00:00:00 R/2/~ me
## 5 46 2 1 42 1947 Domini~ 1947-09-04 00:00:00 R/2/~ me
## 6 46 2 1 70 1947 Mexico 1947-09-04 00:00:00 R/2/~ me
## 7 46 2 1 90 1947 Guatem~ 1947-09-04 00:00:00 R/2/~ me
## 8 46 2 1 91 1947 Hondur~ 1947-09-04 00:00:00 R/2/~ me
## 9 46 2 1 92 1947 El Sal~ 1947-09-04 00:00:00 R/2/~ me
## 10 46 2 1 93 1947 Nicara~ 1947-09-04 00:00:00 R/2/~ me
## # ... with 2,121,272 more rows, and 1 more variable: has_topic <dbl>
# Perform gather again, then filter
votes_gathered <- votes_joined %>%
gather(topic, has_topic, me:ec) %>%
filter(has_topic == 1)
Recoding the topics
There’s one more step of data cleaning to make this more interpretable. Right now, topics are represented by two-letter codes:
me
: Palestinian conflictnu
: Nuclear weapons and nuclear materialdi
: Arms control and disarmamenthr
: Human rightsco
: Colonialismec
: Economic developmentSo that you can interpret the data more easily, recode the data to replace these codes with their full name. You can do that with dplyr
’s recode()
function, which replaces values with ones you specify:
example <- c("apple", "banana", "apple", "orange")
recode(example,
apple = "plum",
banana = "grape")
# Replace the two-letter codes in topic: votes_tidied
votes_tidied <- votes_gathered %>%
mutate(topic = recode(topic,
me = "Palestinian conflict",
nu = "Nuclear weapons and nuclear material",
di = "Arms control and disarmament",
hr = "Human rights",
co = "Colonialism",
ec = "Economic development"))
Summarize by country, year, and topic
In previous exercises, you summarized the votes dataset by country, by year, and by country-year combination.
Now that you have topic
as an additional variable, you can summarize the votes for each combination of country, year, and topic (e.g. for the United States in 2013 on the topic of nuclear weapons.)
# Print votes_tidied
votes_tidied
## # A tibble: 350,032 x 10
## rcid session vote ccode year country date unres topic
## <dbl> <dbl> <dbl> <int> <dbl> <chr> <dttm> <chr> <chr>
## 1 77 2 1 2 1947 United~ 1947-11-06 00:00:00 R/2/~ Pale~
## 2 77 2 1 20 1947 Canada 1947-11-06 00:00:00 R/2/~ Pale~
## 3 77 2 3 40 1947 Cuba 1947-11-06 00:00:00 R/2/~ Pale~
## 4 77 2 1 41 1947 Haiti 1947-11-06 00:00:00 R/2/~ Pale~
## 5 77 2 1 42 1947 Domini~ 1947-11-06 00:00:00 R/2/~ Pale~
## 6 77 2 2 70 1947 Mexico 1947-11-06 00:00:00 R/2/~ Pale~
## 7 77 2 1 90 1947 Guatem~ 1947-11-06 00:00:00 R/2/~ Pale~
## 8 77 2 2 91 1947 Hondur~ 1947-11-06 00:00:00 R/2/~ Pale~
## 9 77 2 2 92 1947 El Sal~ 1947-11-06 00:00:00 R/2/~ Pale~
## 10 77 2 1 93 1947 Nicara~ 1947-11-06 00:00:00 R/2/~ Pale~
## # ... with 350,022 more rows, and 1 more variable: has_topic <dbl>
# Summarize the percentage "yes" per country-year-topic
by_country_year_topic <- votes_tidied %>%
group_by(country, year, topic) %>%
summarize(total = n(), percent_yes = mean(vote == 1)) %>%
ungroup()
## `summarise()` regrouping output by 'country', 'year' (override with `.groups` argument)
# Print by_country_year_topic
by_country_year_topic
## # A tibble: 26,968 x 5
## country year topic total percent_yes
## <chr> <dbl> <chr> <int> <dbl>
## 1 Afghanistan 1947 Colonialism 8 0.5
## 2 Afghanistan 1947 Economic development 1 0
## 3 Afghanistan 1947 Human rights 1 0
## 4 Afghanistan 1947 Palestinian conflict 6 0
## 5 Afghanistan 1949 Arms control and disarmament 3 0
## 6 Afghanistan 1949 Colonialism 22 0.864
## 7 Afghanistan 1949 Economic development 1 1
## 8 Afghanistan 1949 Human rights 3 0
## 9 Afghanistan 1949 Nuclear weapons and nuclear material 3 0
## 10 Afghanistan 1949 Palestinian conflict 11 0.818
## # ... with 26,958 more rows
Visualizing trends in topics for one country
You can now visualize the trends in percentage “yes” over time for all six topics side-by-side. Here, you’ll visualize them just for the United States.
# Load the ggplot2 package
library(ggplot2)
# Filter by_country_year_topic for just the US
US_by_country_year_topic <- by_country_year_topic %>%
filter(country == "United States")
# Plot % yes over time for the US, faceting by topic
ggplot(US_by_country_year_topic, aes(year, percent_yes)) +
geom_line() +
facet_wrap(~ topic)
Nesting by topic and country
In the last chapter, you constructed a linear model for each country by nesting the data in each country, fitting a model to each dataset, then tidying each model with broom
and unnesting the coefficients. The code looked something like this:
country_coefficients <- by_year_country %>%
nest(-country) %>%
mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
tidied = map(model, tidy)) %>%
unnest(tidied)
Now, you’ll again be modeling change in “percentage” yes over time, but instead of fitting one model for each country, you’ll fit one for each combination of country and topic.
# Load purrr, tidyr, and broom
library(purrr)
library(tidyr)
library(broom)
# Print by_country_year_topic
by_country_year_topic
## # A tibble: 26,968 x 5
## country year topic total percent_yes
## <chr> <dbl> <chr> <int> <dbl>
## 1 Afghanistan 1947 Colonialism 8 0.5
## 2 Afghanistan 1947 Economic development 1 0
## 3 Afghanistan 1947 Human rights 1 0
## 4 Afghanistan 1947 Palestinian conflict 6 0
## 5 Afghanistan 1949 Arms control and disarmament 3 0
## 6 Afghanistan 1949 Colonialism 22 0.864
## 7 Afghanistan 1949 Economic development 1 1
## 8 Afghanistan 1949 Human rights 3 0
## 9 Afghanistan 1949 Nuclear weapons and nuclear material 3 0
## 10 Afghanistan 1949 Palestinian conflict 11 0.818
## # ... with 26,958 more rows
# Fit model on the by_country_year_topic dataset
country_topic_coefficients <- by_country_year_topic %>%
nest(-country, -topic) %>%
mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
tidied = map(model, tidy)) %>%
unnest(tidied)
## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?
## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
# Print country_topic_coefficients
country_topic_coefficients
## # A tibble: 2,384 x 9
## country topic data model term estimate std.error statistic p.value
## <chr> <chr> <list> <lis> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghani~ Coloniali~ <tibbl~ <lm> (Inte~ -9.20e+0 1.96 -4.70 4.76e-5
## 2 Afghani~ Coloniali~ <tibbl~ <lm> year 5.11e-3 0.000989 5.17 1.23e-5
## 3 Afghani~ Economic ~ <tibbl~ <lm> (Inte~ -1.15e+1 3.62 -3.17 3.49e-3
## 4 Afghani~ Economic ~ <tibbl~ <lm> year 6.24e-3 0.00183 3.42 1.85e-3
## 5 Afghani~ Human rig~ <tibbl~ <lm> (Inte~ -7.27e+0 4.37 -1.66 1.06e-1
## 6 Afghani~ Human rig~ <tibbl~ <lm> year 4.08e-3 0.00221 1.85 7.43e-2
## 7 Afghani~ Palestini~ <tibbl~ <lm> (Inte~ -1.33e+1 3.57 -3.73 8.66e-4
## 8 Afghani~ Palestini~ <tibbl~ <lm> year 7.17e-3 0.00180 3.98 4.42e-4
## 9 Afghani~ Arms cont~ <tibbl~ <lm> (Inte~ -1.38e+1 4.13 -3.33 2.53e-3
## 10 Afghani~ Arms cont~ <tibbl~ <lm> year 7.37e-3 0.00208 3.54 1.49e-3
## # ... with 2,374 more rows
Great work! You can ignore the warning messages in the console for now.
Interpreting tidy models
Now you have both the slope and intercept terms for each model. Just as you did in the last chapter with the tidied coefficients, you’ll need to filter for only the slope terms.
You’ll also have to extract only cases that are statistically significant, which means adjusting the p-value for the number of models, and then filtering to include only significant changes.
# Create country_topic_filtered
country_topic_filtered <- country_topic_coefficients %>%
filter(term == "year") %>%
mutate(p.adjusted = p.adjust(p.value)) %>%
filter(p.adjusted < .05)
Checking models visually
Vanuatu (an island nation in the Pacific Ocean) sharply changed its pattern of voting on the topic of Palestinian conflict.
Let’s examine this country’s voting patterns more closely. Recall that the by_country_year_topic
dataset contained one row for each combination of country, year, and topic. You can use that to create a plot of Vanuatu’s voting, faceted by topic.
# Create vanuatu_by_country_year_topic
vanuatu_by_country_year_topic <- by_country_year_topic %>%
filter(country == "Vanuatu")
# Plot of percentage "yes" over time, faceted by topic
ggplot(vanuatu_by_country_year_topic, aes(year, percent_yes)) +
geom_line() +
facet_wrap(~ topic)
Phenomenal work! Congratulations on finishing the course!