Reading a big.matrix object In this exercise, you'll create your first file-backed big.matrix object using the read.big.matrix() function. The function is meant to look similar to read.table() but, in addition, it needs to know what type of numeric values you want to read ("char", "short", "integer", "double"), it needs the name of the file that will hold the matrix's data (the backing file), and it needs the name of the file to hold information about the matrix (a descriptor file). The result will be a file on the disk holding the value read in along with a descriptor file which holds extra information (like the number of columns and rows) about the resulting big.matrix object. * Load the bigmemory package. * Use the read.big.matrix() function to read a file called "mortgage-sample.csv", which contains a header and is composed of integer values. In addition: * Create a backingfile called "mortgage-sample.bin", and * A descriptor file called "mortgage-sample.desc". * Find the dimensions of x using the dim() function. # Load the bigmemory package library(bigmemory) # Create the big.matrix object: x x <- read.big.matrix("mortgage-sample.csv", header = TRUE, type = "integer", backingfile = "mortgage-sample.bin", descriptorfile = "mortgage-sample.desc") # Find the dimensions of x dim(x) ---------------------------------------------------------------------------------------------------- Attaching a big.matrix object Now that the big.matrix object is on the disk, we can use the information stored in the descriptor file to instantly make it available during an R session. This means that you don't have to reimport the data set, which takes more time for larger files. You can simply point the bigmemory package at the existing structures on the disk and begin accessing data without the wait. The big.matrix object x is available in your workspace. * Create a new variable mort that points to x by attaching the "mortgage-sample.desc" file using the attach.big.matrix() function. * Verify that the dimensions of mort are the same as the last exercise. * Call head() on mort. # Attach mortgage-sample.desc mort <- attach.big.matrix("mortgage-sample.desc") # Find the dimensions of mort dim(mort) # Look at the first 6 rows of mort head(mort) ---------------------------------------------------------------------------------------------------- Creating tables with big.matrix objects A final advantage to using big.matrix is that if you know how to use R's matrices, then you know how to use a big.matrix. You can subset columns and rows just as you would a regular matrix, using a numeric or character vector and the object returned is an R matrix. Likewise, assignments are the same as with R matrices and after those assignments are made they are stored on disk and can be used in the current and future R sessions. One thing to remember is that $ is not valid for getting a column of either a matrix or a big.matrix. * Create a new variable mort by attaching the "mortgage-sample.desc" file. * Look at the first 3 rows of mort. * Create a table of the number of mortgages for each year in the data set. The column name in the data set is "year". # Create mort mort <- attach.big.matrix("mortgage-sample.desc") # Look at the first 3 rows mort[1:3, ] # Create a table of the number of mortgages for each year in the data set table(mort[,"year"]) Good. Don't forget that this is only a sample of the entire data set. So the values are propotional to the actual total number of mortgages. Does it seem strange that some years had proportionally more total mortgages? ---------------------------------------------------------------------------------------------------- Data summary using bigsummary Now that you know how to import and attach a big.matrix object, you can start exploring the data stored in this object. As mentioned before, there is a whole suite of packages designed to explore and analyze data stored as a big.matrix object. In this exercise, you will use the biganalytics package to create summaries. The reference object mort from the previous exercise is available in your workspace. * Load the biganalytics package. * Use the colmean() function to get the column means of mort. * Use biganalytics' summary() function to get a summary of the variables. # Load the biganalytics package library(biganalytics) # Get the column means of mort colmean(mort) # Use biganalytics' summary function to get a summary of the data summary(mort) Well done! Some categorical variables are already encoded with another value, so there are no NAs listed. In a few sections, we'll go through how to fix this. ---------------------------------------------------------------------------------------------------- Copying matrices and big matrices If you want to copy a big.matrix object, then you need to use the deepcopy() function. This can be useful, especially if you want to create smaller big.matrix objects. In this exercise, you'll copy a big.matrix object and show the reference behavior for these types of objects. The big.matrix object mort is available in your workspace. * Create a new variable, first_three, which is an explicit copy of mort, but consists of only the first three columns. * Set another variable, first_three_2 to first_three. * Set the value in the first row and first column of first_three to NA. * Verify the change shows up in first_three_2 but not in mort. # Use deepcopy() to create first_three first_three <- deepcopy(mort, cols = 1:3, backingfile = "first_three.bin", descriptorfile = "first_three.desc") # Set first_three_2 equal to first_three first_three_2 <- first_three # Set the value in the first row and first column of first_three to NA first_three[1, 1] <- NA # Verify the change shows up in first_three_2 first_three_2[1, 1] # but not in mort mort[1, 1] Great! You know the basics of loading, attaching, subsetting, and copying big.matrix objects. In the next section we'll explore and begin analyzing the data set.