Race and Ethnic Representation in the Mortgage Data In this exercise, you'll get the race and ethnic proportions of borrowers in the mortgage data set, adjusted by the total number of borrowers. This will turn the race and ethnicity table you created before into a proportion. Later on, you'll use these values to adjust for the race and ethnic proportions of the US population. The reference object mort is available in your workspace. * Create a table from the borrower_race column in the mortgage data (mort) using bigtable(). * Where the race is known (first 7 values) find the proportion by dividing by the sum of the counts. # Create a table of borrower_race column race_table <- bigtable(mort, "borrower_race") # Rename the elements names(race_table) <- race_cat[as.numeric(names(race_table))] # Find the proportion race_table[1:7] / sum(race_table[1:7]) ---------------------------------------------------------------------------------------------------- Comparing the Borrower Race/Ethnicity and their Proportions In this exercise, you'll compare the US race and ethnic proportions to proportion of total borrowers by race or ethnicity. This will provide an initial check to see if each group is borrowing at a rate comparable to its proportional representation in the United States. The task is similar to the last exercise, but this time you'll use iotools to accomplish it. * Create a matrix from each chunk of "mortgage-sample.csv". * Add up the rows for all columns of race_table_chunks. # Create table of the borrower_race race_table_chunks <- chunk.apply( "mortgage-sample.csv", function(chunk) { x <- mstrsplit(chunk, sep = ",", type = "integer") colnames(x) <- mort_names table(x[, "borrower_race"]) }, CH.MAX.SIZE = 1e5) # Add up the columns race_table <- colSums(race_table_chunks) # Find the proportion borrower_proportion <- race_table[1:7] / sum(race_table[1:7]) # Create the matrix matrix(c(pop_proportion, borrower_proportion), byrow = TRUE, nrow = 2, dimnames = list(c("Population Proportion", "Borrower Proportion"), race_cat[1:7])) ---------------------------------------------------------------------------------------------------- Looking for Predictable Missingness If data are missing completely at random, then you shouldn't be able to predict when a variable is missing based on the rest of the data. Therefore, if you can predict missingness then the data are not missing completely at random. So, let's use the glm() function to fit a logistic regression, looking for missingness based on affordability in the mort variable you created earlier. If you don't find any structure in the missing data - i.e., the slope variables are not significant - it does not mean that you have proven the data are missing at random, but it is plausible. * Create a variable indicating if the "borrower_race" is missing (equal to 9) in the mortgage data. * Create a factor variable of the "affordability" column. * Regress affordability_factor on borrower_race_ind and call summary() on it. # Create a variable indicating if borrower_race is missing in the mortgage data borrower_race_ind <- mort[, "borrower_race"] == 9 # Create a factor variable indicating the affordability affordability_factor <- factor(mort[, "affordability"]) # Perform a logistic regression summary(glm(borrower_race_ind ~ affordability_factor, family = binomial)) It doesn't look like there is a relationsip between missingness in the borrower's race and the affordability of the home. ---------------------------------------------------------------------------------------------------- Borrower Race and Ethnicity by Year (II) In this exercise, you'll use both iotools and bigtabulate to tabulate borrower race and ethnicity by year. iotools and bigtabulate are loaded in your workspace. * Create a function make_table() that reads in chunk as a matrix and then tabulates it by borrower race and year. * Use chunk.apply() to import the data from the file connection we created for you. * Convert race_year_table to a data frame. # Open a connection to the file and skip the header fc <- file("mortgage-sample.csv", "rb") readLines(fc, n = 1) # Create a function to read chunks make_table <- function(chunk) { # Create a matrix m <- mstrsplit(chunk, sep = ",", type = "integer") colnames(m) <- mort_names # Create the output table bigtable(m, c("borrower_race", "year")) } # Import data using chunk.apply race_year_table <- chunk.apply(fc, make_table) # Close connection close(fc) # Cast it to a data frame rydf <- as.data.frame(race_year_table) # Create a new column Race with race/ethnicity rydf$Race <- race_cat > rydf 2008 2009 2010 2011 2012 2013 2014 2015 Race 1 11 18 13 16 15 12 29 29 Native Am 2 384 583 603 568 770 673 369 488 Asian 3 363 320 209 204 258 312 185 169 Black 4 33 38 21 13 28 22 17 23 Pacific Is 5 5552 7739 6301 5746 8192 7535 4110 4831 White 6 43 85 65 58 89 78 46 64 Two or More 7 577 563 384 378 574 613 439 512 Hispanic 9 1505 1755 1240 1013 1009 971 519 618 Not Avail ---------------------------------------------------------------------------------------------------- Visualizing the Adjusted Demographic Trends Let's compare changes in borrowing across demographics over time. The data frame rydf you created in the last exercise is available in your workspace. Note: We removed the row corresponding to "Not Avail". tidyr and ggplot2 are loaded in your workspace. * Print the rydf and pop_proportion objects. * Convert rydf to a long-formatted data frame by gathering all columns except Race. * Create a line chart with Year and Adjusted_Count on the x and y axes, respectively. # View rydf rydf # View pop_proportion pop_proportion # Gather on all variables except Race rydfl <- gather(rydf, Year, Count, -Race) # Create a new adjusted count variable rydfl$Adjusted_Count <- rydfl$Count / pop_proportion[rydfl$Race] # Plot ggplot(rydfl, aes(x = Year, y = Adjusted_Count, group = Race, color = Race)) + geom_line() rydf 2008 2009 2010 2011 2012 2013 2014 2015 Race 1 11 18 13 16 15 12 29 29 Native Am 2 384 583 603 568 770 673 369 488 Asian 3 363 320 209 204 258 312 185 169 Black 4 33 38 21 13 28 22 17 23 Pacific Is 5 5552 7739 6301 5746 8192 7535 4110 4831 White 6 43 85 65 58 89 78 46 64 Two or More 7 577 563 384 378 574 613 439 512 Hispanic pop_proportion Native Am Asian Black Pacific Is White Two or More 0.009 0.048 0.126 0.002 0.724 0.029 Hispanic 0.163 ---------------------------------------------------------------------------------------------------- Relative change in demographic trend In the last exercise, you looked at the changes in borrowing across demographics over time. In this exercise, you'll look at the relative change in demographic trend. To do this, you'll normalize each group's trend by borrowing in the first year, 2008. * Get the first column of rydf as column1. * Loop over columns 1:8, normalizing them by dividing each column by the first column. * Plot Year on the x-axis and Proportion the y-axis, colored and grouped by the Race variable. # View rydf rydf # Get the first column of rydf column1 <- rydf[, 1] # Normalize the first 8 columns for(this_column in 1:8) { rydf[, this_column] <- rydf[, this_column] / column1 } # Convert the data to long format rydf_long <- gather(rydf, Year, Proportion, -Race) # Plot ggplot(rydf_long, aes(x = Year, y = Proportion, group = Race, color = Race)) + geom_line() ---------------------------------------------------------------------------------------------------- Borrower Region by Year In this exercise you'll tabulate the data by year and the msa (city vs rural) variable. All the required packages are loaded in your workspace. * Create a function make_table() that reads in chunk as a matrix and then tabulates it by borrower region (msa) and year. * Use chunk.apply() to import the data from the file connection we created for you. * Run the rest of the code to plot the changes in mortgages received by region. # Open a connection to the file and skip the header fc <- file("mortgage-sample.csv", "rb") readLines(fc, n = 1) # Create a function to read chunks make_table <- function(chunk) { # Create a matrix m <- mstrsplit(chunk, sep = ",", type = "integer") colnames(m) <- mort_names # Create the output table bigtable(m, c("msa", "year")) } # Import data using chunk.apply msa_year_table <- chunk.apply(fc, make_table) # Close connection close(fc) # Convert to a data frame df_msa <- as.data.frame(msa_year_table) # Rename columns df_msa$MSA <- c("rural", "city") # Gather on all columns except Year df_msa_long <- gather(df_msa, Year, Count, -MSA) # Plot ggplot(df_msa_long, aes(x = Year, y = Count, group = MSA, color = MSA)) + geom_line() ---------------------------------------------------------------------------------------------------- Who is securing federally guaranteed loans? Borrower's income is not in the data set. However, annual income divided by the median income of people in the local area is. This is called the Borrower Income Ratio. Let's look at the proportion of federally guaranteed loans for each borrower income category. * Use the bigtable() function to make a table of the borrower_income_ratio by federal_guarantee. * For each row in ir_by_fg, divide by the sum of the row. # Tabulate borrower_income_ratio and federal_guarantee ir_by_fg <- bigtable(mort, c("borrower_income_ratio", "federal_guarantee")) # Label the columns and rows of the table dimnames(ir_by_fg) <- list(income_cat, guarantee_cat) # For each row in ir_by_fg, divide by the sum of the row for (i in seq_len(nrow(ir_by_fg))) { ir_by_fg[i, ] = ir_by_fg[i, ] / sum(ir_by_fg[i, ]) } # Print ir_by_fg ir_by_fg FHA/VA RHS HECM No Guarantee 0 <= 50 0.008944544 0.0014636526 0.0443974630 0.9451943 50 < 80 0.005977548 0.0024055985 0.0026971862 0.9889197 > 80 0.001113022 0.0002428412 0.0006475766 0.9979966 Not Applicable 0.023676880 0.0013927577 0.0487465181 0.9261838