Race and Ethnic Representation in the Mortgage Data

In this exercise, you'll get the race and ethnic proportions of borrowers in the mortgage data set, adjusted by the total
number of borrowers. This will turn the race and ethnicity table you created before into a proportion. Later on, you'll use
these values to adjust for the race and ethnic proportions of the US population.

The reference object mort is available in your workspace.

* Create a table from the borrower_race column in the mortgage data (mort) using bigtable().
* Where the race is known (first 7 values) find the proportion by dividing by the sum of the counts.

# Create a table of borrower_race column
race_table <- bigtable(mort, "borrower_race")

# Rename the elements
names(race_table) <- race_cat[as.numeric(names(race_table))]

# Find the proportion
race_table[1:7] / sum(race_table[1:7])

----------------------------------------------------------------------------------------------------

Comparing the Borrower Race/Ethnicity and their Proportions

In this exercise, you'll compare the US race and ethnic proportions to proportion of total borrowers by race or ethnicity.
This will provide an initial check to see if each group is borrowing at a rate comparable to its proportional representation
in the United States. The task is similar to the last exercise, but this time you'll use iotools to accomplish it.

* Create a matrix from each chunk of "mortgage-sample.csv".
* Add up the rows for all columns of race_table_chunks.

# Create table of the borrower_race 
race_table_chunks <- chunk.apply(
    "mortgage-sample.csv", function(chunk) { 
        x <- mstrsplit(chunk, sep = ",", type = "integer") 
        colnames(x) <- mort_names 
        table(x[, "borrower_race"])
}, CH.MAX.SIZE = 1e5)

# Add up the columns
race_table <- colSums(race_table_chunks)

# Find the proportion
borrower_proportion <- race_table[1:7] / sum(race_table[1:7])

# Create the matrix
matrix(c(pop_proportion, borrower_proportion), byrow = TRUE, nrow = 2,
  dimnames = list(c("Population Proportion", "Borrower Proportion"), race_cat[1:7]))

----------------------------------------------------------------------------------------------------

Looking for Predictable Missingness

If data are missing completely at random, then you shouldn't be able to predict when a variable is missing based on the rest
of the data. Therefore, if you can predict missingness then the data are not missing completely at random. So, let's use the
glm() function to fit a logistic regression, looking for missingness based on affordability in the mort variable you created
earlier. If you don't find any structure in the missing data - i.e., the slope variables are not significant - it does not
mean that you have proven the data are missing at random, but it is plausible.

* Create a variable indicating if the "borrower_race" is missing (equal to 9) in the mortgage data.
* Create a factor variable of the "affordability" column.
* Regress affordability_factor on borrower_race_ind and call summary() on it.

# Create a variable indicating if borrower_race is missing in the mortgage data
borrower_race_ind <- mort[, "borrower_race"] == 9

# Create a factor variable indicating the affordability
affordability_factor <- factor(mort[, "affordability"])

# Perform a logistic regression
summary(glm(borrower_race_ind ~ affordability_factor, family = binomial))

It doesn't look like there is a relationsip between missingness in the borrower's race and the affordability of the home. 

----------------------------------------------------------------------------------------------------

Borrower Race and Ethnicity by Year (II)

In this exercise, you'll use both iotools and bigtabulate to tabulate borrower race and ethnicity by year.

iotools and bigtabulate are loaded in your workspace.

* Create a function make_table() that reads in chunk as a matrix and then tabulates it by borrower race and year.
* Use chunk.apply() to import the data from the file connection we created for you.
* Convert race_year_table to a data frame.

# Open a connection to the file and skip the header
fc <- file("mortgage-sample.csv", "rb")
readLines(fc, n = 1)

# Create a function to read chunks
make_table <- function(chunk) {
    # Create a matrix
    m <- mstrsplit(chunk, sep = ",", type = "integer")
    colnames(m) <- mort_names
    # Create the output table
    bigtable(m, c("borrower_race", "year"))
}

# Import data using chunk.apply
race_year_table <- chunk.apply(fc, make_table)

# Close connection
close(fc)

# Cast it to a data frame
rydf <- as.data.frame(race_year_table)

# Create a new column Race with race/ethnicity
rydf$Race <- race_cat

>
rydf
  2008 2009 2010 2011 2012 2013 2014 2015        Race
1   11   18   13   16   15   12   29   29   Native Am
2  384  583  603  568  770  673  369  488       Asian
3  363  320  209  204  258  312  185  169       Black
4   33   38   21   13   28   22   17   23  Pacific Is
5 5552 7739 6301 5746 8192 7535 4110 4831       White
6   43   85   65   58   89   78   46   64 Two or More
7  577  563  384  378  574  613  439  512    Hispanic
9 1505 1755 1240 1013 1009  971  519  618   Not Avail

----------------------------------------------------------------------------------------------------

Visualizing the Adjusted Demographic Trends

Let's compare changes in borrowing across demographics over time. The data frame rydf you created in the last exercise is
available in your workspace.

Note: We removed the row corresponding to "Not Avail".

tidyr and ggplot2 are loaded in your workspace.

* Print the rydf and pop_proportion objects.
* Convert rydf to a long-formatted data frame by gathering all columns except Race.
* Create a line chart with Year and Adjusted_Count on the x and y axes, respectively.

# View rydf
rydf

# View pop_proportion
pop_proportion 

# Gather on all variables except Race
rydfl <- gather(rydf, Year, Count, -Race)

# Create a new adjusted count variable
rydfl$Adjusted_Count <- rydfl$Count / pop_proportion[rydfl$Race]

# Plot
ggplot(rydfl, aes(x = Year, y = Adjusted_Count, group = Race, color = Race)) + 
    geom_line()
    
    
rydf
  2008 2009 2010 2011 2012 2013 2014 2015        Race
1   11   18   13   16   15   12   29   29   Native Am
2  384  583  603  568  770  673  369  488       Asian
3  363  320  209  204  258  312  185  169       Black
4   33   38   21   13   28   22   17   23  Pacific Is
5 5552 7739 6301 5746 8192 7535 4110 4831       White
6   43   85   65   58   89   78   46   64 Two or More
7  577  563  384  378  574  613  439  512    Hispanic

pop_proportion 
  Native Am       Asian       Black  Pacific Is       White Two or More 
      0.009       0.048       0.126       0.002       0.724       0.029 
   Hispanic 
      0.163 


----------------------------------------------------------------------------------------------------

Relative change in demographic trend

In the last exercise, you looked at the changes in borrowing across demographics over time. In this exercise, you'll look
at the relative change in demographic trend. To do this, you'll normalize each group's trend by borrowing in the first
year, 2008.

* Get the first column of rydf as column1.
* Loop over columns 1:8, normalizing them by dividing each column by the first column.
* Plot Year on the x-axis and Proportion the y-axis, colored and grouped by the Race variable.

# View rydf
rydf

# Get the first column of rydf
column1 <- rydf[, 1]

# Normalize the first 8 columns
for(this_column in 1:8) {
  rydf[, this_column] <- rydf[, this_column] / column1
}

# Convert the data to long format
rydf_long <- gather(rydf, Year, Proportion, -Race)

# Plot
ggplot(rydf_long, aes(x = Year, y = Proportion, group = Race, color = Race)) + 
    geom_line()

----------------------------------------------------------------------------------------------------

Borrower Region by Year

In this exercise you'll tabulate the data by year and the msa (city vs rural) variable.

All the required packages are loaded in your workspace.

* Create a function make_table() that reads in chunk as a matrix and then tabulates it by borrower region (msa) and year.
* Use chunk.apply() to import the data from the file connection we created for you.
* Run the rest of the code to plot the changes in mortgages received by region.

# Open a connection to the file and skip the header
fc <- file("mortgage-sample.csv", "rb")
readLines(fc, n = 1)

# Create a function to read chunks
make_table <- function(chunk) {
    # Create a matrix
    m <- mstrsplit(chunk, sep = ",", type = "integer")
    colnames(m) <- mort_names
    # Create the output table
    bigtable(m, c("msa", "year"))
}

# Import data using chunk.apply
msa_year_table <- chunk.apply(fc, make_table)

# Close connection
close(fc)

# Convert to a data frame
df_msa <- as.data.frame(msa_year_table)

# Rename columns
df_msa$MSA <- c("rural", "city")

# Gather on all columns except Year
df_msa_long <- gather(df_msa, Year, Count, -MSA)

# Plot 
ggplot(df_msa_long, aes(x = Year, y = Count, group = MSA, color = MSA)) + 
    geom_line()

----------------------------------------------------------------------------------------------------

Who is securing federally guaranteed loans?

Borrower's income is not in the data set. However, annual income divided by the median income of people in
the local area is. This is called the Borrower Income Ratio. Let's look at the proportion of federally
guaranteed loans for each borrower income category.

* Use the bigtable() function to make a table of the borrower_income_ratio by federal_guarantee.
* For each row in ir_by_fg, divide by the sum of the row.

# Tabulate borrower_income_ratio and federal_guarantee
ir_by_fg <- bigtable(mort, c("borrower_income_ratio", "federal_guarantee"))

# Label the columns and rows of the table
dimnames(ir_by_fg) <- list(income_cat, guarantee_cat)

# For each row in ir_by_fg, divide by the sum of the row
for (i in seq_len(nrow(ir_by_fg))) {
  ir_by_fg[i, ] = ir_by_fg[i, ] / sum(ir_by_fg[i, ])
}

# Print
ir_by_fg


ir_by_fg
                    FHA/VA          RHS         HECM No Guarantee
0 <= 50        0.008944544 0.0014636526 0.0443974630    0.9451943
50 < 80        0.005977548 0.0024055985 0.0026971862    0.9889197
> 80           0.001113022 0.0002428412 0.0006475766    0.9979966
Not Applicable 0.023676880 0.0013927577 0.0487465181    0.9261838