Tabulating using bigtable

The bigtabulate package provides optimized routines for creating tables and splitting the rows of big.matrix objects. Let's
say you wanted to see the breakdown by ethnicity of mortgages in the housing data. The documentation from the website
provides the mapping from the numerical value to ethnicity. In this exercise, you'll create a table using the bigtable()
function, found in the bigtabulate package.

The character vector race_cat is available in your workspace.

* Load the bigtabulate package.
* Call bigtable() create a variable called race_table.
* Rename the elements of race_table to race_cat using the names() function.

# Load the bigtabulate package
library(bigtabulate)

# Call bigtable to create a variable called race_table
race_table <- bigtable(mort, "borrower_race")

# Rename the elements of race_table
names(race_table) <- race_cat
race_table

----------------------------------------------------------------------------------------------------

Borrower Race and Ethnicity by Year (I)

As a second exercise in creating big tables, suppose you want to see the total count by year, rather than for all years at
once. Then you would create a table for each ethnicity for each year.

The character vector race_cat is available in your workspace.

* Use the bigtable() function to create a table of the borrower race ("borrower_race") by year ("year").
* Use the as.data.frame() function to convert the table into a data.frame and assign it to rfdf.
* Create a new column (Race) holding the race/ethnicity information using the race_cat variable.

# Create a table of the borrower race by year
race_year_table <- bigtable(mort, c("borrower_race", "year"))

# Convert rydf to a data frame
rydf <- as.data.frame(race_year_table)

# Create the new column Race
rydf$Race <- race_cat

# Let's see what it looks like
rydf

----------------------------------------------------------------------------------------------------

Female Proportion Borrowing

In the last exercise, you stratified by year and race (or ethnicity). However, there are lots of other ways you can partition
the data. In this exercise and the next, you'll find the proportion of female borrowers in urban and rural areas by year. 
This exercise is slightly different from the last one because rather than simply finding counts of things you want to get 
the proportion of female borrowers conditioned on the year.

In this exercise, we have defined a function that finds the proportion of female borrowers for urban and rural areas:
female_residence_prop().

* Call female_residence_prop() to find the proportion of female borrowers for urban and rural areas for 2015:
   * The first argument is the data, mort.
   * The second argument is a logical vector corresponding to the row numbers of 2015.
   
female_residence_prop <- function(x, rows) {
    x_subset <- x[rows, ]
    # Find the proporation of female borrowers in urban areas
    prop_female_urban <- sum(x_subset[, "borrower_gender"] == 2 & 
                                 x_subset[, "msa"] == 1) / 
        sum(x_subset[, "msa"] == 1)
    # Find the proporation of female borrowers in rural areas
    prop_female_rural <- sum(x_subset[, "borrower_gender"] == 2 & 
                                 x_subset[, "msa"] == 0) / 
        sum(x_subset[, "msa"] == 0)
    
    c(prop_female_urban, prop_female_rural)
}

# Find the proportion of female borrowers in 2015
female_residence_prop(mort, mort[,"year"] == 2015)

----------------------------------------------------------------------------------------------------

Split

To calculate the proportions for all years, you will use the function female_residence_prop() defined 
in the previous exercise along with three other functions:

* split(): To "split" the mort data by year
* Map(): To "apply" the function female_residence_prop() to each of the subsets returned from split()
* Reduce(): To combine the results obtained from Map()

In this exercise, you will "split" the mort data by year.

* Split the row numbers of the mortgage data by the "year" column and assign the result to spl.
* Call str() on spl to see the results.

# Split the row numbers of the mortage data by year
spl <- split(1:nrow(mort), mort[, "year"])

# Call str on spl
str(spl)

----------------------------------------------------------------------------------------------------

Apply

In this exercise, you will "apply" the function female_residence_prop() to obtain the proportion of female borrowers
for both urban and rural areas for all years using the Map() function.

spl from the previous exercise is available in your workspace.

* Call Map() on each of the row splits of spl.
* Recall that the function to apply is female_residence_prop(). Assign the result to all_years.
* View the str()ucture of all_years.

# For each of the row splits, find the female residence proportion
all_years <- Map(function(rows) female_residence_prop(mort, rows), spl)

# Call str on all_years
str(all_years)

----------------------------------------------------------------------------------------------------

Combine

You now know the female proportion borrowing for urban and rural areas for all years. However, the result resides
in a list. Converting this list to a matrix or data frame may sometimes be convenient in case you want to calculate any
summary statistics or visualize the results. In this exercise, you will combine the results into a matrix.

all_years from the previous exercise is available in your workspace.

* Call Reduce() on all_years to combine the results from the previous exercise. The function to apply here 
  is rbind (short for row bind).
* Use the dimnames() function to add row and column names to this matrix, prop_female.
     * The rownames are the names() of the list all_years.
     
# Collect the results as rows in a matrix
prop_female <- Reduce(rbind, all_years)

# Rename the row and column names
dimnames(prop_female) <- list(names(all_years), c("prop_female_urban", "prop_femal_rural"))

# View the matrix
prop_female

----------------------------------------------------------------------------------------------------

Visualizing Female Proportion Borrowing

The return type of functions in the bigtabulate and biganalytics packages are base R types that can be used just like you
would with any analysis. This means that we can visualize results using ggplot2.

In this exercise, you will visualize the female proportion borrowing for urban and rural areas across all years.

The matrix prop_female from the previous exercise is available in your workspace.

* Load the tidyr and ggplot2 packages.
* Convert prop_female to a data frame using as.data.frame().
* Add a new column, Year. Set it to the row.names() of prop_female_df.
* Call gather() on the columns of prop_female_df to convert it into a long format.

# Load the tidyr and ggplot2 packages
library(tidyr)
library(ggplot2)

# Convert prop_female to a data frame
prop_female_df <- as.data.frame(prop_female)

# Add a new column Year
prop_female_df$Year <- row.names(prop_female_df)

# Call gather on prop_female_df
prop_female_long <- gather(prop_female_df, Region, Prop, -Year)

# Create a line plot
ggplot(prop_female_long, aes(x = Year, y = Prop, group = Region, color = Region)) + 
    geom_line()

----------------------------------------------------------------------------------------------------

The Borrower Income Ratio

The borrower income ratio is the ratio of the borrower’s (or borrowers’) annual income to the median family income of the
area for the reporting year. This is the ratio used to determine whether borrower’s income qualifies for an income-based housing goal.

In the data set mort, missing values are recoded as 9. In this exercise, we replaced the 9's in the "borrower_income_ratio"
column with NA, so you can create a table of the borrower income ratios.

* Load the biganalytics and dplyr packages.
* Call summary() on mort to check that "borrower_income_ratio" now has NAs.
* Using bigtable(), create a table of borrower income ratios for each year.
* Use dplyr's mutate() to add a new column BIR.

# Load biganalytics and dplyr packages
library(biganalytics)
library(dplyr)

# Call summary on mort
summary(mort)

bir_df_wide <- bigtable(mort, c("borrower_income_ratio", "year")) %>% 
    # Turn it into a data.frame
    as.data.frame() %>% 
    # Create a new column called BIR with the corresponding table categories
    mutate(BIR = c(">=0,<=50%", ">50, <=80%", ">80%"))

bir_df_wide

----------------------------------------------------------------------------------------------------

Tidy Big Tables

As a final exercise of using the "tidyverse" packages in combination with the "bigmemory" suite of packages,
you will again use the tidyr and ggplot2 packages to plot the Borrower Income ratio over time.

* Load the tidyr and ggplot2 packages.
* Use the gather() function to gather the counts by year.
* Create a line plot with Year on the x-axis and Count on the y-axis. Color and group by BIR.

# Load the tidyr and ggplot2 packages
library(tidyr)
library(ggplot2)

bir_df_wide %>% 
    # Transform the wide-formatted data.frame into the long format
    gather(Year, Count, -BIR) %>%
    # Use ggplot to create a line plot
    ggplot(aes(x = Year, y = Count, group = BIR, color = BIR)) + 
    geom_line()