Foldable operations (I) An operation that gives the same answer whether you apply it to an entire data set or to chunks of a data set and then on the results on the chunks is sometimes called foldable. The max() and min() operations are an example of this. Here, we have defined a foldable version of the range() function that takes either a vector or list of vectors. Verify that the function works by testing it on the mortgage data set. * Verify that foldable_range() works on the "record_number" column of the mort data set. foldable_range <- function(x) { if (is.list(x)) { # If x is a list then reduce it by the min and max of each element in the list c(Reduce(min, x), Reduce(max, x)) } else { # Otherwise, assume it's a vector and find it's range range(x) } } # Verify that foldable_range() works on the record_number column foldable_range(mort[,"record_number"]) ---------------------------------------------------------------------------------------------------- Foldable operations (II) Now, you'll use the function on partitions of the data set. You should realize that by performing this operation in pieces and then aggregating, you don't need to have all of the data in a variable at once. This point isn't that important with small data sets, like the mortgage sample data, but it is for large data sets. The foldable_range() function is available in your workspace. * Split the rows of mort by the "year" column. * Use foldable_range() to get the range of the "record_number" column chunked by year. # Split the mortgage data by year spl <- split(1:nrow(mort), mort[,"year"]) # Use foldable_range() to get the range of the record numbers foldable_range(Map(function(s) foldable_range(mort[s, "record_number"]), spl)) ---------------------------------------------------------------------------------------------------- Compare read.delim() and read.delim.raw() When processing a sequence of contiguous chunks of data on a hard drive, iotools can turn a raw object into a data.frame or matrix while - at the same time - retrieving the next chunk of data. These optimizations allow iotools to quickly process very large files. * Time the reading of a file using read.delim() five times. * Time the reading of a file using read.delim.raw() five times. # Load the iotools and microbenchmark packages library(iotools) library(microbenchmark) # Time the reading of files microbenchmark( # Time the reading of a file using read.delim five times read.delim("mortgage-sample.csv", header = FALSE, sep = ","), # Time the reading of a file using read.delim.raw five times read.delim.raw("mortgage-sample.csv", header = FALSE, sep = ","), times = 5 ) ---------------------------------------------------------------------------------------------------- Reading raw data and turning it into a data structure As mentioned before, part of what makes iotools fast is that it separates reading data from the hard drive from converting the binary data it into a data.frame or matrix. Data in their binary format are copied from the hard drive into memory as raw objects. These raw objects are then passed to optimized functions that turn them into data.frame or matrix objects. In this exercise, you'll learn how to separate reading data from the disk (using the readAsRaw() function) from converting raw binary data into a matrix or data.frame (using the mstrsplit() and dstrsplit() functions). * Read "mortgage-sample.csv" as a raw vector. * Convert the raw vector contents to a matrix of integers. * Convert the raw vector contents to a data.frame with 16 integer columns. # Read mortgage-sample.csv as a raw vector raw_file_content <- readAsRaw("mortgage-sample.csv") # Convert the raw vector contents to a matrix mort_mat <- mstrsplit(raw_file_content, sep = ",", type = "integer", skip = 1) # Look at the first 6 rows head(mort_mat) # Convert the raw file contents to a data.frame mort_df <- dstrsplit(raw_file_content, sep = ",", col_types = rep("integer", 16), skip = 1) # Look at the first 6 rows head(mort_df) ---------------------------------------------------------------------------------------------------- Reading chunks in as a matrix In this exercise, you'll write a scalable table() function counting the number of urban and rural borrowers in the mortgage dataset using chunk.apply(). By default, chunk.apply() aggregates the processed data using the rbind() function. This means that you can create a table from each of the chunks and then add up the rows of the resulting matrix to get the total counts for the table. We have created a file connection fc to the "mortgage-sample.csv" file and read in the first line to get rid of the header. * In the function make_table(), read each chunk as a matrix. * Call chunk.apply() to read in the data as chunks. * Print counts. * Get the total counts for each column by adding all the rows. make_table <- function(chunk) { # Read each chunk as a matrix x <- mstrsplit(chunk, type = "integer", sep = ",") # Create a table of the number of borrowers (column 3) for each chunk table(x[, 3]) } # Create a file connection to mortgage-sample.csv fc <- file("mortgage-sample.csv", "rb") # Read the first line to get rid of the header readLines(fc, n = 1) # Read the data in chunks counts <- chunk.apply(fc, make_table, CH.MAX.SIZE = 1e5) # Close the file connection close(fc) # Print counts counts # Sum up the chunks colSums(counts) ---------------------------------------------------------------------------------------------------- Reading chunks in as a data.frame In the previous example, we read each chunk into the processing function as a matrix using mstrsplit(). This is fine when we are reading rectangular data where the type of element in each column is the same. When it's not, we might like to read the data in as a data.frame. This can be done by either reading a chunk in as a matrix and then convert it to a data.frame, or you can use the dstrsplit() function. * In the function make_msa_table(), read each chunk as a data frame. * Call chunk.apply() to read in the data as chunks. * Get the total counts for each column by adding all the rows. # Define the function to apply to each chunk make_msa_table <- function(chunk) { # Read each chunk as a data frame x <- dstrsplit(chunk, col_types = rep("integer", length(col_names)), sep = ",") # Set the column names of the data frame that's been read colnames(x) <- col_names # Create new column, msa_pretty, with a string description of where the borrower lives x$msa_pretty <- msa_map[x$msa + 1] # Create a table from the msa_pretty column table(x$msa_pretty) } # Create a file connection to mortgage-sample.csv fc <- file("mortgage-sample.csv", "rb") # Read the first line to get rid of the header readLines(fc, n = 1) # Read the data in chunks counts <- chunk.apply(fc, make_msa_table, CH.MAX.SIZE = 1e5) # Close the file connection close(fc) # Aggregate the counts as before colSums(counts) ---------------------------------------------------------------------------------------------------- Parallelizing calls to chunk.apply The chunk.apply() function can also make use of parallel processes to process data more quickly. When the parallel parameter is set to a value greater than one on Linux and Unix machine (including the Mac) multiple processes read and process data at the same time thereby reducing the execution time. On Windows the parallel parameter is ignored. * Benchmark the function iotools_read_fun(), first with 1 process and then with 3 parallel processes. iotools_read_fun <- function(parallel) { fc <- file("mortgage-sample.csv", "rb") readLines(fc, n = 1) chunk.apply(fc, make_msa_table, CH.MAX.SIZE = 1e5, parallel = parallel) close(fc) } # Benchmark the new function microbenchmark( # Use one process iotools_read_fun(parallel = 1), # Use three processes iotools_read_fun(parallel = 3), times = 20 )