Foldable operations (I)

An operation that gives the same answer whether you apply it to an entire data set or to chunks of a data set and then
on the results on the chunks is sometimes called foldable. The max() and min() operations are an example of this. Here, 
we have defined a foldable version of the range() function that takes either a vector or list of vectors. Verify that the
function works by testing it on the mortgage data set.

* Verify that foldable_range() works on the "record_number" column of the mort data set.

foldable_range <- function(x) {
  if (is.list(x)) {
    # If x is a list then reduce it by the min and max of each element in the list
    c(Reduce(min, x), Reduce(max, x))
  } else {
    # Otherwise, assume it's a vector and find it's range
    range(x)
  }
}

# Verify that foldable_range() works on the record_number column
foldable_range(mort[,"record_number"])

----------------------------------------------------------------------------------------------------

Foldable operations (II)

Now, you'll use the function on partitions of the data set. You should realize that by performing this operation in
pieces and then aggregating, you don't need to have all of the data in a variable at once. This point isn't that
important with small data sets, like the mortgage sample data, but it is for large data sets.

The foldable_range() function is available in your workspace.

* Split the rows of mort by the "year" column.
* Use foldable_range() to get the range of the "record_number" column chunked by year.

# Split the mortgage data by year
spl <- split(1:nrow(mort), mort[,"year"])

# Use foldable_range() to get the range of the record numbers
foldable_range(Map(function(s) foldable_range(mort[s, "record_number"]), spl))

----------------------------------------------------------------------------------------------------

Compare read.delim() and read.delim.raw()

When processing a sequence of contiguous chunks of data on a hard drive, iotools can turn a raw object into a
data.frame or matrix while - at the same time - retrieving the next chunk of data. These optimizations allow
iotools to quickly process very large files.

* Time the reading of a file using read.delim() five times.
* Time the reading of a file using read.delim.raw() five times.

# Load the iotools and microbenchmark packages
library(iotools)
library(microbenchmark)

# Time the reading of files
microbenchmark(
    # Time the reading of a file using read.delim five times
    read.delim("mortgage-sample.csv", header = FALSE, sep = ","),
    # Time the reading of a file using read.delim.raw five times
    read.delim.raw("mortgage-sample.csv", header = FALSE, sep = ","),
    times = 5
)

----------------------------------------------------------------------------------------------------

Reading raw data and turning it into a data structure

As mentioned before, part of what makes iotools fast is that it separates reading data from the hard drive from
converting the binary data it into a data.frame or matrix. Data in their binary format are copied from the hard
drive into memory as raw objects. These raw objects are then passed to optimized functions that turn them into
data.frame or matrix objects.

In this exercise, you'll learn how to separate reading data from the disk (using the readAsRaw() function) from
converting raw binary data into a matrix or data.frame (using the mstrsplit() and dstrsplit() functions).

* Read "mortgage-sample.csv" as a raw vector.
* Convert the raw vector contents to a matrix of integers.
* Convert the raw vector contents to a data.frame with 16 integer columns.

# Read mortgage-sample.csv as a raw vector
raw_file_content <- readAsRaw("mortgage-sample.csv")

# Convert the raw vector contents to a matrix
mort_mat <- mstrsplit(raw_file_content, sep = ",", type = "integer", skip = 1)

# Look at the first 6 rows
head(mort_mat)

# Convert the raw file contents to a data.frame
mort_df <- dstrsplit(raw_file_content, sep = ",", col_types = rep("integer", 16), skip = 1)

# Look at the first 6 rows
head(mort_df)

----------------------------------------------------------------------------------------------------

Reading chunks in as a matrix

In this exercise, you'll write a scalable table() function counting the number of urban and rural borrowers in
the mortgage dataset using chunk.apply(). By default, chunk.apply() aggregates the processed data using the rbind()
function. This means that you can create a table from each of the chunks and then add up the rows of the resulting
matrix to get the total counts for the table.

We have created a file connection fc to the "mortgage-sample.csv" file and read in the first line to get rid of 
the header.

* In the function make_table(), read each chunk as a matrix.
* Call chunk.apply() to read in the data as chunks.
* Print counts.
* Get the total counts for each column by adding all the rows.

make_table <- function(chunk) {
    # Read each chunk as a matrix
    x <- mstrsplit(chunk, type = "integer", sep = ",")
    # Create a table of the number of borrowers (column 3) for each chunk
    table(x[, 3])
}

# Create a file connection to mortgage-sample.csv
fc <- file("mortgage-sample.csv", "rb")

# Read the first line to get rid of the header
readLines(fc, n = 1)

# Read the data in chunks
counts <- chunk.apply(fc, make_table, CH.MAX.SIZE = 1e5)

# Close the file connection
close(fc)

# Print counts
counts

# Sum up the chunks
colSums(counts)

----------------------------------------------------------------------------------------------------

Reading chunks in as a data.frame

In the previous example, we read each chunk into the processing function as a matrix using mstrsplit(). This is 
fine when we are reading rectangular data where the type of element in each column is the same. When it's not, 
we might like to read the data in as a data.frame. This can be done by either reading a chunk in as a matrix and 
then convert it to a data.frame, or you can use the dstrsplit() function.

* In the function make_msa_table(), read each chunk as a data frame.
* Call chunk.apply() to read in the data as chunks.
* Get the total counts for each column by adding all the rows.

# Define the function to apply to each chunk
make_msa_table <- function(chunk) {
    # Read each chunk as a data frame
    x <- dstrsplit(chunk, col_types = rep("integer", length(col_names)), sep = ",")
    # Set the column names of the data frame that's been read
    colnames(x) <- col_names
    # Create new column, msa_pretty, with a string description of where the borrower lives
    x$msa_pretty <- msa_map[x$msa + 1]
    # Create a table from the msa_pretty column
    table(x$msa_pretty)
}

# Create a file connection to mortgage-sample.csv
fc <- file("mortgage-sample.csv", "rb")

# Read the first line to get rid of the header
readLines(fc, n = 1)

# Read the data in chunks
counts <- chunk.apply(fc, make_msa_table, CH.MAX.SIZE = 1e5)

# Close the file connection
close(fc)

# Aggregate the counts as before
colSums(counts)

----------------------------------------------------------------------------------------------------

Parallelizing calls to chunk.apply

The chunk.apply() function can also make use of parallel processes to process data more quickly. When the
parallel parameter is set to a value greater than one on Linux and Unix machine (including the Mac) multiple
processes read and process data at the same time thereby reducing the execution time. On Windows the parallel
parameter is ignored.

* Benchmark the function iotools_read_fun(), first with 1 process and then with 3 parallel processes.

iotools_read_fun <- function(parallel) {
    fc <- file("mortgage-sample.csv", "rb")
    readLines(fc, n = 1)
    chunk.apply(fc, make_msa_table,
                CH.MAX.SIZE = 1e5, parallel = parallel)
    close(fc)
}

# Benchmark the new function
microbenchmark(
    # Use one process
    iotools_read_fun(parallel = 1), 
    # Use three processes
    iotools_read_fun(parallel = 3), 
    times = 20
)