Machine Learning with caret in R
Max Kuhn - DataCamp
Course Description
Machine learning is the study and application of algorithms that learn
from and make predictions on data. From search results to self-driving
cars, it has manifested itself in all areas of our lives and is one of
the most exciting and fast growing fields of research in the world of
data science. This course teaches the big ideas in machine learning: how
to build and evaluate predictive models, how to tune them for optimal
performance, how to preprocess data for better results, and much more.
The popular caret
R package, which provides a consistent
interface to all of R’s most powerful machine learning facilities, is
used throughout the course.
1 Regression models: fitting them and evaluating their performance
In the first chapter of this course, you’ll fit regression models with train()
and evaluate their out-of-sample performance using cross-validation and root-mean-square error (RMSE).
1.1 Welcome to the Toolbox
1.1.1 In-sample RMSE for linear regression
RMSE is commonly calculated in-sample on your training set. What’s a potential drawback to calculating training set error?
-
There’s no potential drawback to calculating training set error, but you should calculate
instead of RMSE. -
You have no idea how well your model generalizes to new data (i.e. overfitting).
-
You should manually inspect your model to validate its coefficients and calculate RMSE.
1.1.2 In-sample RMSE for linear regression on diamonds
As you saw in the video, included in the course is the diamonds
dataset, which is a classic dataset from the ggplot2
package. The dataset contains physical attributes of diamonds as well
as the price they sold for. One interesting modeling challenge is
predicting diamond price based on their attributes using something like a
linear regression.
Recall that to fit a linear regression, you use the lm()
function in the following format:
mod <- lm(y ~ x, my_data)
To make predictions using mod
on the original data, you call the predict()
function:
pred <- predict(mod, my_data)
diamonds
dataset predicting price
using all other variables as predictors (i.e. price ~ .
). Save the result to model
.
library(tidyverse)
data("diamonds")
# Fit lm model: model
<- lm(price ~ ., diamonds) model
model
on the full original dataset and save the result to p
.
# Predict on full data: p
<- predict(model, diamonds) p
error
.
# Compute errors: error
<- p - diamonds[["price"]] error
# Calculate RMSE
sqrt(mean(error ^ 2))
## [1] 1129.843
Great work! Now you know how to manually calculate RMSE for your model’s predictions!
1.2 Out-of-sample error measures
1.2.1 Out-of-sample RMSE for linear regression
What is the advantage of using a train/test split rather than just validating your model in-sample on the training set?
-
It takes less time to calculate error on the test set, since it is smaller than the training set.
-
There is no advantage to using a test set. You can just use adjusted
on your training set. -
It gives you an estimate of how well your model performs on new data.
1.2.2 Randomly order the data frame
One way you can take a train/test split of a dataset is to order the dataset randomly, then divide it into the two sets. This ensures that the training set and test set are both random samples and that any biases in the ordering of the dataset (e.g. if it had originally been ordered by price or size) are not retained in the samples we take for training and testing your models. You can think of this like shuffling a brand new deck of playing cards before dealing hands.
First, you set a random seed so that your work is reproducible and you get the same random split each time you run your script:
set.seed(42)
Next, you use the sample()
function to shuffle the row indices of the diamonds
dataset. You can later use these indices to reorder the dataset.
rows <- sample(nrow(diamonds))
Finally, you can use this random vector to reorder the diamonds dataset:
diamonds <- diamonds[rows, ]
# Set seed
set.seed(42)
rows
.
# Shuffle row indices: rows
<- sample(nrow(diamonds)) rows
diamonds
data frame, assigning to shuffled_diamonds
.
# Randomly order data
<- diamonds[rows, ] shuffled_diamonds
Great job! Randomly ordering your dataset is important for many machine learning methods.
1.2.3 Try an 80/20 split
Now that your dataset is randomly ordered, you can split the first 80% of it into a training set, and the last 20% into a test set. You can do this by choosing a split point approximately 80% of the way through your data:
split <- round(nrow(mydata) * 0.80)
You can then use this point to break off the first 80% of the dataset as a training set:
mydata[1:split, ]
And then you can use that same point to determine the test set:
mydata[(split + 1):nrow(mydata), ]
diamonds
dataset. Call this index split
.
# Determine row to split on: split
<- round(nrow(diamonds) * 0.80) split
train
using that index.
# Create train
<- diamonds[1:split, ] train
test
using that index.
# Create test
<- diamonds[(split + 1):nrow(diamonds), ] test
Well done! Because you already randomly ordered your dataset, it’s easy to split off a random test set.
1.2.4 Predict on test set
Now that you have a randomly split training set and test set, you can use the lm()
function as you did in the first exercise to fit a model to your
training set, rather than the entire dataset. Recall that you can use
the formula interface to the linear regression function to fit a model
with a specified target variable using all other variables in the
dataset as predictors:
mod <- lm(y ~ ., training_data)
You can use the predict()
function to make predictions from
that model on new data. The new dataset must have all of the columns
from the training data, but they can be in a different order with
different values. Here, rather than re-predicting on the training set,
you can predict on the test set, which you did not use for training the
model. This will allow you to determine the out-of-sample error for the
model in the next exercise:
p <- predict(model, new_data)
lm()
model called model
to predict price
using all other variables as covariates. Be sure to use the training set, train
.
# Fit lm model on train: model
<- lm(price ~ ., train) model
test
, using predict()
. Store these values in a vector called p
.
# Predict on test: p
<- predict(model, test) p
Excellent work! R makes it very easy to predict with a model on new data.
1.2.5 Calculate test set RMSE by hand
Now that you have predictions on the test set, you can use these predictions to calculate an error metric (in this case RMSE) on the test set and see how the model performs out-of-sample, rather than in-sample as you did in the first exercise. You first do this by calculating the errors between the predicted diamond prices and the actual diamond prices by subtracting the predictions from the actual values.
Once you have an error vector, calculating RMSE is as simple as squaring it, taking the mean, then taking the square root:
sqrt(mean(error^2))
test
, model
, and p
are loaded in your workspace.
error
.
# Compute errors: error
<- p - test[["price"]] error
# Calculate RMSE
sqrt(mean(error^2))
## [1] 796.8922
Good Job! Calculating RMSE on a test set is exactly the same as calculating it on a training set.
1.2.6 Comparing out-of-sample RMSE to in-sample RMSE
Why is the test set RMSE higher than the training set RMSE?
-
Because you overfit the training set and the test set contains data the model hasn’t seen before.
-
Because you should not use a test set at all and instead just look at error on the training set.
-
Because the test set has a smaller sample size the training set and thus the mean error is lower.
1.3 Cross-validation
1.3.1 Advantage of cross-validation
What is the advantage of cross-validation over a single train/test split?
-
There is no advantage to cross-validation, just as there is no advantage to a single train/test split. You should be validating your models in-sample with a metric like adjusted
. -
You can pick the best test set to minimize the reported RMSE of your model.
-
It gives you multiple estimates of out-of-sample error, rather than a single estimate.
1.3.2 10-fold cross-validation
As you saw in the video, a better approach to validating models is to
use multiple systematic test sets, rather than a single random
train/test split. Fortunately, the caret
package makes this very easy to do:
model <- train(y ~ ., my_data)
caret
supports many types of cross-validation, and you can
specify which type of cross-validation and the number of
cross-validation folds with the trainControl()
function, which you pass to the trControl
argument in train()
:
model <- train(
y ~ .,
my_data,
method = "lm",
trControl = trainControl(
method = "cv",
number = 10,
verboseIter = TRUE
)
)
It’s important to note that you pass the method for modeling to the main train()
function and the method for cross-validation to the trainControl()
function.
price
using all other variables in the diamonds
dataset as predictors. Use the train()
function and 10-fold cross-validation. (Note that we’ve taken a subset of the full diamonds
dataset to speed up this operation, but it’s still named diamonds
.)
library(caret)
# Fit lm model using 10-fold CV: model
<- train(
model ~ .,
price
diamonds,method = "lm",
trControl = trainControl(
method = "cv",
number = 10,
verboseIter = TRUE
) )
## + Fold01: intercept=TRUE
## - Fold01: intercept=TRUE
## + Fold02: intercept=TRUE
## - Fold02: intercept=TRUE
## + Fold03: intercept=TRUE
## - Fold03: intercept=TRUE
## + Fold04: intercept=TRUE
## - Fold04: intercept=TRUE
## + Fold05: intercept=TRUE
## - Fold05: intercept=TRUE
## + Fold06: intercept=TRUE
## - Fold06: intercept=TRUE
## + Fold07: intercept=TRUE
## - Fold07: intercept=TRUE
## + Fold08: intercept=TRUE
## - Fold08: intercept=TRUE
## + Fold09: intercept=TRUE
## - Fold09: intercept=TRUE
## + Fold10: intercept=TRUE
## - Fold10: intercept=TRUE
## Aggregating results
## Fitting final model on full training set
# Print model to console
model
## Linear Regression
##
## 53940 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 48547, 48546, 48546, 48547, 48545, 48547, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1131.015 0.9196398 740.6117
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Good job! Caret does all the work of splitting test sets and calculating RMSE for you!
1.3.3 5-fold cross-validation
In this course, you will use a wide variety of datasets to explore the full flexibility of the caret
package. Here, you will use the famous Boston housing dataset, where
the goal is to predict median home values in various Boston suburbs.
You can use exactly the same code as in the previous exercise, but change the dataset used by the model:
model <- train(
medv ~ .,
Boston, # <- new!
method = "lm",
trControl = trainControl(
method = "cv",
number = 10,
verboseIter = TRUE
)
)
Next, you can reduce the number of cross-validation folds from 10 to 5 using the number
argument to the trainControl()
argument:
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE
)
lm()
model to the Boston
housing dataset, such that medv
is the response variable and all other variables are explanatory variables.
library(mlbench)
data(BostonHousing)
=BostonHousing
Boston# Fit lm model using 5-fold CV: model
<- train(
model ~ .,
medv
Boston,method = "lm",
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE
) )
## + Fold1: intercept=TRUE
## - Fold1: intercept=TRUE
## + Fold2: intercept=TRUE
## - Fold2: intercept=TRUE
## + Fold3: intercept=TRUE
## - Fold3: intercept=TRUE
## + Fold4: intercept=TRUE
## - Fold4: intercept=TRUE
## + Fold5: intercept=TRUE
## - Fold5: intercept=TRUE
## Aggregating results
## Fitting final model on full training set
# Print model to console
model
## Linear Regression
##
## 506 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 405, 405, 406, 403, 405
## Resampling results:
##
## RMSE Rsquared MAE
## 4.860247 0.7209221 3.398114
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Great work! Caret makes it easy to try different validation schemes with the same model and compare RMSE.
1.3.4 5 x 5-fold cross-validation
You can do more than just one iteration of cross-validation. Repeated cross-validation gives you a better estimate of the test-set error. You can also repeat the entire cross-validation procedure. This takes longer, but gives you many more out-of-sample datasets to look at and much more precise assessments of how well the model performs.
One of the awesome things about the train()
function in caret
is how easy it is to run very different models or methods of
cross-validation just by tweaking a few simple arguments to the function
call. For example, you could repeat your entire cross-validation
procedure 5 times for greater confidence in your estimates of the
model’s out-of-sample accuracy, e.g.:
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE
)
Boston
housing dataset.
# Fit lm model using 5 x 5-fold CV: model
<- train(
model ~ .,
medv
Boston,method = "lm",
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE
) )
## + Fold1.Rep1: intercept=TRUE
## - Fold1.Rep1: intercept=TRUE
## + Fold2.Rep1: intercept=TRUE
## - Fold2.Rep1: intercept=TRUE
## + Fold3.Rep1: intercept=TRUE
## - Fold3.Rep1: intercept=TRUE
## + Fold4.Rep1: intercept=TRUE
## - Fold4.Rep1: intercept=TRUE
## + Fold5.Rep1: intercept=TRUE
## - Fold5.Rep1: intercept=TRUE
## + Fold1.Rep2: intercept=TRUE
## - Fold1.Rep2: intercept=TRUE
## + Fold2.Rep2: intercept=TRUE
## - Fold2.Rep2: intercept=TRUE
## + Fold3.Rep2: intercept=TRUE
## - Fold3.Rep2: intercept=TRUE
## + Fold4.Rep2: intercept=TRUE
## - Fold4.Rep2: intercept=TRUE
## + Fold5.Rep2: intercept=TRUE
## - Fold5.Rep2: intercept=TRUE
## + Fold1.Rep3: intercept=TRUE
## - Fold1.Rep3: intercept=TRUE
## + Fold2.Rep3: intercept=TRUE
## - Fold2.Rep3: intercept=TRUE
## + Fold3.Rep3: intercept=TRUE
## - Fold3.Rep3: intercept=TRUE
## + Fold4.Rep3: intercept=TRUE
## - Fold4.Rep3: intercept=TRUE
## + Fold5.Rep3: intercept=TRUE
## - Fold5.Rep3: intercept=TRUE
## + Fold1.Rep4: intercept=TRUE
## - Fold1.Rep4: intercept=TRUE
## + Fold2.Rep4: intercept=TRUE
## - Fold2.Rep4: intercept=TRUE
## + Fold3.Rep4: intercept=TRUE
## - Fold3.Rep4: intercept=TRUE
## + Fold4.Rep4: intercept=TRUE
## - Fold4.Rep4: intercept=TRUE
## + Fold5.Rep4: intercept=TRUE
## - Fold5.Rep4: intercept=TRUE
## + Fold1.Rep5: intercept=TRUE
## - Fold1.Rep5: intercept=TRUE
## + Fold2.Rep5: intercept=TRUE
## - Fold2.Rep5: intercept=TRUE
## + Fold3.Rep5: intercept=TRUE
## - Fold3.Rep5: intercept=TRUE
## + Fold4.Rep5: intercept=TRUE
## - Fold4.Rep5: intercept=TRUE
## + Fold5.Rep5: intercept=TRUE
## - Fold5.Rep5: intercept=TRUE
## Aggregating results
## Fitting final model on full training set
# Print model to console
model
## Linear Regression
##
## 506 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 405, 406, 405, 403, 405, 405, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4.845724 0.7277269 3.402735
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Fantastic work! You can use caret to do some very complicated cross-validation schemes.
1.3.5 Making predictions on new data
Finally, the model you fit with the train()
function has the exact same predict()
interface as the linear regression models you fit earlier in this chapter.
After fitting a model with train()
, you can simply call predict()
with new data, e.g:
predict(my_model, new_data)
Use the predict()
function to make predictions with model
on the full Boston
housing dataset. Print the result to the console.
# Predict on full Boston dataset
predict(model, Boston)
## 1 2 3 4 5 6 7
## 30.0038434 25.0255624 30.5675967 28.6070365 27.9435242 25.2562845 23.0018083
## 8 9 10 11 12 13 14
## 19.5359884 11.5236369 18.9202621 18.9994965 21.5867957 20.9065215 19.5529028
## 15 16 17 18 19 20 21
## 19.2834821 19.2974832 20.5275098 16.9114013 16.1780111 18.4061360 12.5238575
## 22 23 24 25 26 27 28
## 17.6710367 15.8328813 13.8062853 15.6783383 13.3866856 15.4639765 14.7084743
## 29 30 31 32 33 34 35
## 19.5473729 20.8764282 11.4551176 18.0592329 8.8110574 14.2827581 13.7067589
## 36 37 38 39 40 41 42
## 23.8146353 22.3419371 23.1089114 22.9150261 31.3576257 34.2151023 28.0205641
## 43 44 45 46 47 48 49
## 25.2038663 24.6097927 22.9414918 22.0966982 20.4232003 18.0365509 9.1065538
## 50 51 52 53 54 55 56
## 17.2060775 21.2815254 23.9722228 27.6558508 24.0490181 15.3618477 31.1526495
## 57 58 59 60 61 62 63
## 24.8568698 33.1091981 21.7753799 21.0849356 17.8725804 18.5111021 23.9874286
## 64 65 66 67 68 69 70
## 22.5540887 23.3730864 30.3614836 25.5305651 21.1133856 17.4215379 20.7848363
## 71 72 73 74 75 76 77
## 25.2014886 21.7426577 24.5574496 24.0429571 25.5049972 23.9669302 22.9454540
## 78 79 80 81 82 83 84
## 23.3569982 21.2619827 22.4281737 28.4057697 26.9948609 26.0357630 25.0587348
## 85 86 87 88 89 90 91
## 24.7845667 27.7904920 22.1685342 25.8927642 30.6746183 30.8311062 27.1190194
## 92 93 94 95 96 97 98
## 27.4126673 28.9412276 29.0810555 27.0397736 28.6245995 24.7274498 35.7815952
## 99 100 101 102 103 104 105
## 35.1145459 32.2510280 24.5802202 25.5941347 19.7901368 20.3116713 21.4348259
## 106 107 108 109 110 111 112
## 18.5399401 17.1875599 20.7504903 22.6482911 19.7720367 20.6496586 26.5258674
## 113 114 115 116 117 118 119
## 20.7732364 20.7154831 25.1720888 20.4302559 23.3772463 23.6904326 20.3357836
## 120 121 122 123 124 125 126
## 20.7918087 21.9163207 22.4710778 20.5573856 16.3666198 20.5609982 22.4817845
## 127 128 129 130 131 132 133
## 14.6170663 15.1787668 18.9386859 14.0557329 20.0352740 19.4101340 20.0619157
## 134 135 136 137 138 139 140
## 15.7580767 13.2564524 17.2627773 15.8784188 19.3616395 13.8148390 16.4488147
## 141 142 143 144 145 146 147
## 13.5714193 3.9888551 14.5949548 12.1488148 8.7282236 12.0358534 15.8208206
## 148 149 150 151 152 153 154
## 8.5149902 9.7184414 14.8045137 20.8385815 18.3010117 20.1228256 17.2860189
## 155 156 157 158 159 160 161
## 22.3660023 20.1037592 13.6212589 33.2598270 29.0301727 25.5675277 32.7082767
## 162 163 164 165 166 167 168
## 36.7746701 40.5576584 41.8472817 24.7886738 25.3788924 37.2034745 23.0874875
## 169 170 171 172 173 174 175
## 26.4027396 26.6538211 22.5551466 24.2908281 22.9765722 29.0719431 26.5219434
## 176 177 178 179 180 181 182
## 30.7220906 25.6166931 29.1374098 31.4357197 32.9223157 34.7244046 27.7655211
## 183 184 185 186 187 188 189
## 33.8878732 30.9923804 22.7182001 24.7664781 35.8849723 33.4247672 32.4119915
## 190 191 192 193 194 195 196
## 34.5150995 30.7610949 30.2893414 32.9191871 32.1126077 31.5587100 40.8455572
## 197 198 199 200 201 202 203
## 36.1277008 32.6692081 34.7046912 30.0934516 30.6439391 29.2871950 37.0714839
## 204 205 206 207 208 209 210
## 42.0319312 43.1894984 22.6903480 23.6828471 17.8544721 23.4942899 17.0058772
## 211 212 213 214 215 216 217
## 22.3925110 17.0604275 22.7389292 25.2194255 11.1191674 24.5104915 26.6033477
## 218 219 220 221 222 223 224
## 28.3551871 24.9152546 29.6865277 33.1841975 23.7745666 32.1405196 29.7458199
## 225 226 227 228 229 230 231
## 38.3710245 39.8146187 37.5860575 32.3995325 35.4566524 31.2341151 24.4844923
## 232 233 234 235 236 237 238
## 33.2883729 38.0481048 37.1632863 31.7138352 25.2670557 30.1001074 32.7198716
## 239 240 241 242 243 244 245
## 28.4271706 28.4294068 27.2937594 23.7426248 24.1200789 27.4020841 16.3285756
## 246 247 248 249 250 251 252
## 13.3989126 20.0163878 19.8618443 21.2883131 24.0798915 24.2063355 25.0421582
## 253 254 255 256 257 258 259
## 24.9196401 29.9456337 23.9722832 21.6958089 37.5110924 43.3023904 36.4836142
## 260 261 262 263 264 265 266
## 34.9898859 34.8121151 37.1663133 40.9892850 34.4463409 35.8339755 28.2457430
## 267 268 269 270 271 272 273
## 31.2267359 40.8395575 39.3179239 25.7081791 22.3029553 27.2034097 28.5116947
## 274 275 276 277 278 279 280
## 35.4767660 36.1063916 33.7966827 35.6108586 34.8399338 30.3519266 35.3098070
## 281 282 283 284 285 286 287
## 38.7975697 34.3312319 40.3396307 44.6730834 31.5968909 27.3565923 20.1017415
## 288 289 290 291 292 293 294
## 27.0420667 27.2136458 26.9139584 33.4356331 34.4034963 31.8333982 25.8178324
## 295 296 297 298 299 300 301
## 24.4298235 28.4576434 27.3626700 19.5392876 29.1130984 31.9105461 30.7715945
## 302 303 304 305 306 307 308
## 28.9427587 28.8819102 32.7988723 33.2090546 30.7683179 35.5622686 32.7090512
## 309 310 311 312 313 314 315
## 28.6424424 23.5896583 18.5426690 26.8788984 23.2813398 25.5458025 25.4812006
## 316 317 318 319 320 321 322
## 20.5390990 17.6157257 18.3758169 24.2907028 21.3252904 24.8868224 24.8693728
## 323 324 325 326 327 328 329
## 22.8695245 19.4512379 25.1178340 24.6678691 23.6807618 19.3408962 21.1741811
## 330 331 332 333 334 335 336
## 24.2524907 21.5926089 19.9844661 23.3388800 22.1406069 21.5550993 20.6187291
## 337 338 339 340 341 342 343
## 20.1609718 19.2849039 22.1667232 21.2496577 21.4293931 30.3278880 22.0473498
## 344 345 346 347 348 349 350
## 27.7064791 28.5479412 16.5450112 14.7835964 25.2738008 27.5420512 22.1483756
## 351 352 353 354 355 356 357
## 20.4594409 20.5460542 16.8806383 25.4025351 14.3248663 16.5948846 19.6370469
## 358 359 360 361 362 363 364
## 22.7180661 22.2021889 19.2054806 22.6661611 18.9319262 18.2284680 20.2315081
## 365 366 367 368 369 370 371
## 37.4944739 14.2819073 15.5428625 10.8316232 23.8007290 32.6440736 34.6068404
## 372 373 374 375 376 377 378
## 24.9433133 25.9998091 6.1263250 0.7777981 25.3071306 17.7406106 20.2327441
## 379 380 381 382 383 384 385
## 15.8333130 16.8351259 14.3699483 18.4768283 13.4276828 13.0617751 3.2791812
## 386 387 388 389 390 391 392
## 8.0602217 6.1284220 5.6186481 6.4519857 14.2076474 17.2122518 17.2988727
## 393 394 395 396 397 398 399
## 9.8911664 20.2212419 17.9418118 20.3044578 19.2955908 16.3363278 6.5516232
## 400 401 402 403 404 405 406
## 10.8901678 11.8814587 17.8117451 18.2612659 12.9794878 7.3781636 8.2111586
## 407 408 409 410 411 412 413
## 8.0662619 19.9829479 13.7075637 19.8526845 15.2230830 16.9607198 1.7185181
## 414 415 416 417 418 419 420
## 11.8057839 -4.2813107 9.5837674 13.3666081 6.8956236 6.1477985 14.6066179
## 421 422 423 424 425 426 427
## 19.6000267 18.1242748 18.5217713 13.1752861 14.6261762 9.9237498 16.3459065
## 428 429 430 431 432 433 434
## 14.0751943 14.2575624 13.0423479 18.1595569 18.6955435 21.5272830 17.0314186
## 435 436 437 438 439 440 441
## 15.9609044 13.3614161 14.5207938 8.8197601 4.8675110 13.0659131 12.7060970
## 442 443 444 445 446 447 448
## 17.2955806 18.7404850 18.0590103 11.5147468 11.9740036 17.6834462 18.1269524
## 449 450 451 452 453 454 455
## 17.5183465 17.2274251 16.5227163 19.4129110 18.5821524 22.4894479 15.2800013
## 456 457 458 459 460 461 462
## 15.8208934 12.6872558 12.8763379 17.1866853 18.5124761 19.0486053 20.1720893
## 463 464 465 466 467 468 469
## 19.7740732 22.4294077 20.3191185 17.8861625 14.3747852 16.9477685 16.9840576
## 470 471 472 473 474 475 476
## 18.5883840 20.1671944 22.9771803 22.4558073 25.5782463 16.3914763 16.1114628
## 477 478 479 480 481 482 483
## 20.5348160 11.5427274 19.2049630 21.8627639 23.4687887 27.0988732 28.5699430
## 484 485 486 487 488 489 490
## 21.0839878 19.4551620 22.2222591 19.6559196 21.3253610 11.8558372 8.2238669
## 491 492 493 494 495 496 497
## 3.6639967 13.7590854 15.9311855 20.6266205 20.6124941 16.8854196 14.0132079
## 498 499 500 501 502 503 504
## 19.1085414 21.2980517 18.4549884 20.4687085 23.5333405 22.3757189 27.6274261
## 505 506
## 26.1279668 22.3442123
Awesome job! Predicting with a caret model is as easy as predicting with a regular model!
2 Classification models: fitting them and evaluating their performance
In this chapter, you’ll fit classification models with train()
and evaluate their out-of-sample performance using cross-validation and area under the curve (AUC).
2.1 Logistic regression on sonar
2.1.1 Why a train/test split?
What is the point of making a train/test split for binary classification problems?
-
To make the problem harder for the model by reducing the dataset size.
-
To evaluate your models out-of-sample, on new data.
-
To reduce the dataset size, so your models fit faster.
-
There is no real reason; it is no different than evaluating your models in-sample.
2.1.2 Try a 60/40 split
As you saw in the video, you’ll be working with the Sonar
dataset in this chapter, using a 60% training set and a 40% test set.
We’ll practice making a train/test split one more time, just to be sure
you have the hang of it. Recall that you can use the sample()
function to get a random permutation of the row indices in a dataset, to use when making train/test splits, e.g.:
n_obs <- nrow(my_data)
permuted_rows <- sample(n_obs)
And then use those row indices to randomly reorder the dataset, e.g.:
my_data <- my_data[permuted_rows, ]
Once your dataset is randomly ordered, you can split off the first 60% as a training set and the last 40% as a test set.
Sonar
, assigning to n_obs
.
data(Sonar)
# Get the number of observations
<- nrow(Sonar) n_obs
Sonar
and store the result in permuted_rows
.
# Shuffle row indices: permuted_rows
<- sample(n_obs) permuted_rows
permuted_rows
to randomly reorder the rows of Sonar
, saving as Sonar_shuffled
.
# Randomly order data: Sonar
<- Sonar[permuted_rows, ] Sonar_shuffled
split
.
# Identify row to split on: split
<- round(n_obs * 0.6) split
Sonar_shuffled
as a training set.
# Create train
<- Sonar_shuffled[1:split, ] train
Sonar_shuffled
as the test set.
# Create test
<- Sonar_shuffled[(split + 1):n_obs, ] test
Excellent work! Randomly shuffling your data makes it easy to manually create a train/test split.
2.1.3 Fit a logistic regression model
Once you have your random training and test sets you can fit a logistic regression model to your training set using the glm()
function. glm()
is a more advanced version of lm()
that allows for more varied types of regression models, aside from plain vanilla ordinary least squares regression.
Be sure to pass the argument family = “binomial”
to glm()
to specify that you want to do logistic (rather than linear) regression. For example:
glm(Target ~ ., family = "binomial", dataset)
Don’t worry about warnings like glm.fit: algorithm did not converge
or glm.fit: fitted probabilities numerically 0 or 1 occurred
. These are common on smaller datasets and usually don’t cause any issues. They typically mean your dataset is perfectly separable, which can cause problems for the math behind the model, but R’s glm()
function is almost always robust enough to handle this case with no problems.
Once you have a glm()
model fit to your dataset, you can predict the outcome (e.g. rock or mine) on the test
set using the predict()
function with the argument type = “response”
:
predict(my_model, test, type = "response")
model
to predict Class
using all other variables as predictors. Use the training set for Sonar
.
# Fit glm model: model
<- glm(Class ~ ., family = "binomial", train) model
test
set using that model. Call the result p
like you’ve done before.
# Predict on test: p
<- predict(model, test, type = "response") p
Great work! Manually fitting a glm model in R is very similar to fitting an lm model.
2.2 Confusion matrix
2.2.1 Confusion matrix takeaways
What information does a confusion matrix provide?
-
True positive rates
-
True negative rates
-
False positive rates
-
False negative rates
-
All of the above
2.2.2 Calculate a confusion matrix
As you saw in the video, a confusion matrix is a very useful tool for calibrating the output of a model and examining all possible outcomes of your predictions (true positive, true negative, false positive, false negative).
Before you make your confusion matrix, you need to “cut” your predicted
probabilities at a given threshold to turn probabilities into a factor
of class predictions. Combine ifelse()
with factor()
as follows:
pos_or_neg <- ifelse(probability_prediction > threshold, positive_class, negative_class)
p_class <- factor(pos_or_neg, levels = levels(test_values))
confusionMatrix()
in caret
improves on table()
from base R by adding lots of useful ancillary statistics in addition
to the base rates in the table. You can calculate the confusion matrix
(and the associated statistics) using the predicted outcomes as well as
the actual outcomes, e.g.:
confusionMatrix(p_class, test_values)
ifelse()
to create a character vector, m_or_r
that is the positive class, “M”
, when p
is greater than 0.5, and the negative class, “R”
, otherwise.
# If p exceeds threshold of 0.5, M else R: m_or_r
<- ifelse(p > 0.5, "M", "R") m_or_r
m_or_r
to be a factor, p_class
, with levels the same as those of test[[“Class”]]
.
# Convert to factor: p_class
<- factor(m_or_r, levels = levels(test[["Class"]])) p_class
confusionMatrix()
, passing p_class
and the “Class”
column from the test
dataset.
# Create confusion matrix
confusionMatrix(p_class, test[["Class"]])
## Confusion Matrix and Statistics
##
## Reference
## Prediction M R
## M 7 31
## R 32 13
##
## Accuracy : 0.241
## 95% CI : (0.1538, 0.3473)
## No Information Rate : 0.5301
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.5258
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.17949
## Specificity : 0.29545
## Pos Pred Value : 0.18421
## Neg Pred Value : 0.28889
## Prevalence : 0.46988
## Detection Rate : 0.08434
## Detection Prevalence : 0.45783
## Balanced Accuracy : 0.23747
##
## 'Positive' Class : M
##
Great work! The confusionMatrix function is a very easy way to get a detailed summary of your model’s accuracy.
2.2.3 Calculating accuracy
Use confusionMatrix(p_class, test[[“Class”]])
to calculate a confusion matrix on the test set.
What is the test set accuracy of this model (rounded to the nearest percent)?
-
58%
-
83%
-
70%
-
51%
Nice one! This is the model’s accuracy.
2.2.4 Calculating true positive rate
Use confusionMatrix(p_class, test[[“Class”]])
to calculate a confusion matrix on the test set.
What is the test set true positive rate (or sensitivity) of this model (rounded to the nearest percent)?
-
58%
-
83%
-
70%
-
51%
Nice one!
2.2.5 Calculating true negative rate
Use confusionMatrix(p_class, test[[“Class”]])
to calculate a confusion matrix on the test set.
What is the test set true negative rate (or specificity) of this model (rounded to the nearest percent)?
-
58%
-
83%
-
70%
-
51%
Good job!
2.3 Class probabilities and predictions
2.3.1 Probabilities and classes
What’s the relationship between the predicted probabilities and the predicted classes?
-
You determine the predicted probabilities by looking at the average accuracy of the predicted classes.
-
There is no relationship; they’re completely different things.
-
Predicted classes are based off of predicted probabilities plus a classification threshold.
2.3.2 Try another threshold
In the previous exercises, you used a threshold of 0.50 to cut your predicted probabilities to make class predictions (rock vs mine). However, this classification threshold does not always align with the goals for a given modeling problem.
For example, pretend you want to identify the objects you are really certain are mines. In this case, you might want to use a probability threshold of 0.90 to get fewer predicted mines, but with greater confidence in each prediction.
The code pattern for cutting probabilities into predicted classes, then calculating a confusion matrix, was shown in Exercise 7 of this chapter.
ifelse()
to create a character vector, m_or_r
that is the positive class, “M”
, when p
is greater than 0.9, and the negative class, “R”
, otherwise.
# If p exceeds threshold of 0.9, M else R: m_or_r
<- ifelse(p > 0.9, "M", "R") m_or_r
m_or_r
to be a factor, p_class
, with levels the same as those of test[[“Class”]]
.
# Convert to factor: p_class
<- factor(m_or_r, levels = levels(test[["Class"]])) p_class
confusionMatrix()
, passing p_class
and the “Class”
column from the test
dataset.
# Create confusion matrix
confusionMatrix(p_class, test[["Class"]])
## Confusion Matrix and Statistics
##
## Reference
## Prediction M R
## M 7 31
## R 32 13
##
## Accuracy : 0.241
## 95% CI : (0.1538, 0.3473)
## No Information Rate : 0.5301
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.5258
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.17949
## Specificity : 0.29545
## Pos Pred Value : 0.18421
## Neg Pred Value : 0.28889
## Prevalence : 0.46988
## Detection Rate : 0.08434
## Detection Prevalence : 0.45783
## Balanced Accuracy : 0.23747
##
## 'Positive' Class : M
##
Amazing! Note that there are (slightly) fewer predicted mines with this higher threshold: 55 (40 + 15) as compared to 57 for the 0.50 threshold.
2.3.3 From probabilites to confusion matrix
Conversely, say you want to be really certain that your model correctly identifies all the mines as mines. In this case, you might use a prediction threshold of 0.10, instead of 0.90.
The code pattern for cutting probabilities into predicted classes, then calculating a confusion matrix, was shown in Exercise 7 of this chapter.
ifelse()
to create a character vector, m_or_r
that is the positive class, “M”
, when p
is greater than 0.1, and the negative class, “R”
, otherwise.
# If p exceeds threshold of 0.1, M else R: m_or_r
<- ifelse(p > 0.1, "M", "R") m_or_r
m_or_r
to be a factor, p_class
, with levels the same as those of test[[“Class”]]
.
# Convert to factor: p_class
<- factor(m_or_r, levels = levels(test[["Class"]])) p_class
confusionMatrix()
, passing p_class
and the “Class”
column from the test
dataset.
# Create confusion matrix
confusionMatrix(p_class, test[["Class"]])
## Confusion Matrix and Statistics
##
## Reference
## Prediction M R
## M 7 31
## R 32 13
##
## Accuracy : 0.241
## 95% CI : (0.1538, 0.3473)
## No Information Rate : 0.5301
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.5258
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.17949
## Specificity : 0.29545
## Pos Pred Value : 0.18421
## Neg Pred Value : 0.28889
## Prevalence : 0.46988
## Detection Rate : 0.08434
## Detection Prevalence : 0.45783
## Balanced Accuracy : 0.23747
##
## 'Positive' Class : M
##
Awesome! Note that there are (slightly) more predicted mines with this lower threshold: 58 (40 + 18) as compared to 47 for the 0.50 threshold.
2.4 Introducing the ROC curve
2.4.1 What’s the value of a ROC curve?
What is the primary value of an ROC curve?
-
It has a cool acronym.
-
It can be used to determine the true positive and false positive rates for a particular classification threshold.
-
It evaluates all possible thresholds for splitting predicted probabilities into predicted classes.
2.4.2 Plot an ROC curve
As you saw in the video, an ROC curve is a really useful shortcut for summarizing the performance of a classifier over all possible thresholds. This saves you a lot of tedious work computing class predictions for many different thresholds and examining the confusion matrix for each.
My favorite package for computing ROC curves is caTools
, which contains a function called colAUC()
.
This function is very user-friendly and can actually calculate ROC
curves for multiple predictors at once. In this case, you only need to
calculate the ROC curve for one predictor, e.g.:
colAUC(predicted_probabilities, actual, plotROC = TRUE)
The function will return a score called AUC (more on that later) and the plotROC = TRUE
argument will return the plot of the ROC curve for visual inspection.
model
, test
, and train
from the last exercise using the sonar data are loaded in your workspace.
type = “response”
) on the test set, then store the result as p
.
# Predict on test: p
<- predict(model, test, type = "response") p
library(caTools)
# Make ROC curve
colAUC(p, test[["Class"]], plotROC = TRUE)
## [,1]
## M vs. R 0.8161422
Great work! The colAUC function makes plotting a roc curve as easy as calculating a confusion matrix.
2.5 Area under the curve (AUC)
2.5.1 Model, ROC, and AUC
What is the AUC of a perfect model?
-
0.00
-
0.50
-
1.00
2.5.2 Customizing trainControl
As you saw in the video, area under the ROC curve is a very useful, single-number summary of a model’s ability to discriminate the positive from the negative class (e.g. mines from rocks). An AUC of 0.5 is no better than random guessing, an AUC of 1.0 is a perfectly predictive model, and an AUC of 0.0 is perfectly anti-predictive (which rarely happens).
This is often a much more useful metric than simply ranking models by their accuracy at a set threshold, as different models might require different calibration steps (looking at a confusion matrix at each step) to find the optimal classification threshold for that model.
You can use the trainControl()
function in caret
to use AUC (instead of acccuracy), to tune the parameters of your models. The twoClassSummary()
convenience function allows you to do this easily.
When using twoClassSummary()
, be sure to always include the argument classProbs = TRUE
or your model will throw an error! (You cannot calculate AUC with just
class predictions. You need to have class probabilities as well.)
trainControl
object to use twoClassSummary
rather than defaultSummary
.
trainControl()
to return class probabilities.
# Create trainControl object: myControl
<- trainControl(
myControl method = "cv",
number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE, # IMPORTANT!
verboseIter = TRUE
)
Great work! Don’t forget the classProbs argument to train control, especially if you’re going to calculate AUC or logloss.
2.5.3 Using custom trainControl
Now that you have a custom trainControl
object, it’s easy to fit caret
models that use AUC rather than accuracy to tune and evaluate the model. You can just pass your custom trainControl
object to the train()
function via the trControl
argument, e.g.:
train(<standard arguments here>, trControl = myControl)
This syntax gives you a convenient way to store a lot of custom modeling
parameters and then use them across multiple different calls to train()
. You will make extensive use of this trick in Chapter 5.
train()
to predict Class
from all other variables in the Sonar
data (that is, Class ~ .
). It should be a glm
model (that is, set method
to “glm”
) using your custom trainControl
object, myControl
. Save the result to model
.
# Train glm with custom trainControl: model
<- train(
model ~ .,
Class
Sonar, method = "glm",
trControl = myControl
)
## + Fold01: parameter=none
## - Fold01: parameter=none
## + Fold02: parameter=none
## - Fold02: parameter=none
## + Fold03: parameter=none
## - Fold03: parameter=none
## + Fold04: parameter=none
## - Fold04: parameter=none
## + Fold05: parameter=none
## - Fold05: parameter=none
## + Fold06: parameter=none
## - Fold06: parameter=none
## + Fold07: parameter=none
## - Fold07: parameter=none
## + Fold08: parameter=none
## - Fold08: parameter=none
## + Fold09: parameter=none
## - Fold09: parameter=none
## + Fold10: parameter=none
## - Fold10: parameter=none
## Aggregating results
## Fitting final model on full training set
# Print model to console
model
## Generalized Linear Model
##
## 208 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 187, 187, 188, 187, 188, 188, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7255051 0.775 0.66
Great work! Note that fitting a glm with caret often produces warnings about convergence or probabilities. These warnings can almost always be safely ignored, as you can use the glm’s predictions to validate whether the model is accurate enough for your task.
3 Tuning model parameters to improve performance
In this chapter, you will use the train()
function to tweak model parameters through cross-validation and grid search.
3.1 Random forests and wine
3.1.1 Random forests vs. linear models
What’s the primary advantage of random forests over linear models?
-
They make you sound cooler during job interviews.
-
You can’t understand what’s going on inside of a random forest model, so you don’t have to explain it to anyone.
-
A random forest is a more flexible model than a linear model, but just as easy to fit.
3.1.2 Fit a random forest
As you saw in the video, random forest models are much more flexible than linear models, and can model complicated nonlinear effects as well as automatically capture interactions between variables. They tend to give very good results on real world data, so let’s try one out on the wine quality dataset, where the goal is to predict the human-evaluated quality of a batch of wine, given some of the machine-measured chemical and physical properties of that batch.
Fitting a random forest model is exactly the same as fitting a
generalized linear regression model, as you did in the previous chapter.
You simply change the method
argument in the train
function to be “ranger”
. The ranger
package is a rewrite of R’s classic randomForest
package and fits models much faster, but gives almost exactly the same results. We suggest that all beginners use the ranger
package for random forest modeling.
model
on the wine quality dataset, wine
, such that quality
is the response variable and all other variables are explanatory variables.
method = “ranger”
.
tuneLength
of 1.
=readRDS("/Users/cliex159/Documents/Rstudio/DataCamp/MachineLearningwithcaretinR/datasets/wine_100.RDS")
wine# Fit random forest: model
<- train(
model ~ .,
quality tuneLength = 1,
data = wine,
method = "ranger",
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE
) )
## + Fold1: mtry=4, min.node.size=5, splitrule=variance
## - Fold1: mtry=4, min.node.size=5, splitrule=variance
## + Fold1: mtry=4, min.node.size=5, splitrule=extratrees
## - Fold1: mtry=4, min.node.size=5, splitrule=extratrees
## + Fold2: mtry=4, min.node.size=5, splitrule=variance
## - Fold2: mtry=4, min.node.size=5, splitrule=variance
## + Fold2: mtry=4, min.node.size=5, splitrule=extratrees
## - Fold2: mtry=4, min.node.size=5, splitrule=extratrees
## + Fold3: mtry=4, min.node.size=5, splitrule=variance
## - Fold3: mtry=4, min.node.size=5, splitrule=variance
## + Fold3: mtry=4, min.node.size=5, splitrule=extratrees
## - Fold3: mtry=4, min.node.size=5, splitrule=extratrees
## + Fold4: mtry=4, min.node.size=5, splitrule=variance
## - Fold4: mtry=4, min.node.size=5, splitrule=variance
## + Fold4: mtry=4, min.node.size=5, splitrule=extratrees
## - Fold4: mtry=4, min.node.size=5, splitrule=extratrees
## + Fold5: mtry=4, min.node.size=5, splitrule=variance
## - Fold5: mtry=4, min.node.size=5, splitrule=variance
## + Fold5: mtry=4, min.node.size=5, splitrule=extratrees
## - Fold5: mtry=4, min.node.size=5, splitrule=extratrees
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 4, splitrule = variance, min.node.size = 5 on full training set
model
to the console.
# Print model to console
model
## Random Forest
##
## 100 samples
## 12 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 81, 80, 79, 80, 80
## Resampling results across tuning parameters:
##
## splitrule RMSE Rsquared MAE
## variance 0.6423140 0.3314318 0.4940912
## extratrees 0.6785689 0.2637406 0.5106034
##
## Tuning parameter 'mtry' was held constant at a value of 4
## Tuning
## parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 4, splitrule = variance
## and min.node.size = 5.
Awesome job! Fitting a random forest is just as easy as fitting a glm. Caret makes it very easy to try out many different models.
3.2 Explore a wider model space
3.2.1 Advantage of a longer tune length
What’s the advantage of a longer tuneLength
?
-
You explore more potential models and can potentially find a better model.
-
Your models take less time to fit.
-
There’s no advantage; you’ll always end up with the same final model.
3.2.2 Try a longer tune length
Recall from the video that random forest models have a primary tuning parameter of mtry
,
which controls how many variables are exposed to the splitting search
routine at each split. For example, suppose that a tree has a total of
10 splits and mtry = 2
. This means that there are 10 samples of 2 predictors each time a split is evaluated.
Use a larger tuning grid this time, but stick to the defaults provided by the train()
function. Try a tuneLength
of 3, rather than 1, to explore some more potential models, and plot the resulting model using the plot
function.
model
, using the wine
dataset on the quality
variable with all other variables as explanatory variables. (This will take a few seconds to run, so be patient!)
# Fit random forest: model
<- train(
model ~ .,
quality tuneLength = 3,
data = wine,
method = "ranger",
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE
) )
## + Fold1: mtry= 2, min.node.size=5, splitrule=variance
## - Fold1: mtry= 2, min.node.size=5, splitrule=variance
## + Fold1: mtry= 7, min.node.size=5, splitrule=variance
## - Fold1: mtry= 7, min.node.size=5, splitrule=variance
## + Fold1: mtry=12, min.node.size=5, splitrule=variance
## - Fold1: mtry=12, min.node.size=5, splitrule=variance
## + Fold1: mtry= 2, min.node.size=5, splitrule=extratrees
## - Fold1: mtry= 2, min.node.size=5, splitrule=extratrees
## + Fold1: mtry= 7, min.node.size=5, splitrule=extratrees
## - Fold1: mtry= 7, min.node.size=5, splitrule=extratrees
## + Fold1: mtry=12, min.node.size=5, splitrule=extratrees
## - Fold1: mtry=12, min.node.size=5, splitrule=extratrees
## + Fold2: mtry= 2, min.node.size=5, splitrule=variance
## - Fold2: mtry= 2, min.node.size=5, splitrule=variance
## + Fold2: mtry= 7, min.node.size=5, splitrule=variance
## - Fold2: mtry= 7, min.node.size=5, splitrule=variance
## + Fold2: mtry=12, min.node.size=5, splitrule=variance
## - Fold2: mtry=12, min.node.size=5, splitrule=variance
## + Fold2: mtry= 2, min.node.size=5, splitrule=extratrees
## - Fold2: mtry= 2, min.node.size=5, splitrule=extratrees
## + Fold2: mtry= 7, min.node.size=5, splitrule=extratrees
## - Fold2: mtry= 7, min.node.size=5, splitrule=extratrees
## + Fold2: mtry=12, min.node.size=5, splitrule=extratrees
## - Fold2: mtry=12, min.node.size=5, splitrule=extratrees
## + Fold3: mtry= 2, min.node.size=5, splitrule=variance
## - Fold3: mtry= 2, min.node.size=5, splitrule=variance
## + Fold3: mtry= 7, min.node.size=5, splitrule=variance
## - Fold3: mtry= 7, min.node.size=5, splitrule=variance
## + Fold3: mtry=12, min.node.size=5, splitrule=variance
## - Fold3: mtry=12, min.node.size=5, splitrule=variance
## + Fold3: mtry= 2, min.node.size=5, splitrule=extratrees
## - Fold3: mtry= 2, min.node.size=5, splitrule=extratrees
## + Fold3: mtry= 7, min.node.size=5, splitrule=extratrees
## - Fold3: mtry= 7, min.node.size=5, splitrule=extratrees
## + Fold3: mtry=12, min.node.size=5, splitrule=extratrees
## - Fold3: mtry=12, min.node.size=5, splitrule=extratrees
## + Fold4: mtry= 2, min.node.size=5, splitrule=variance
## - Fold4: mtry= 2, min.node.size=5, splitrule=variance
## + Fold4: mtry= 7, min.node.size=5, splitrule=variance
## - Fold4: mtry= 7, min.node.size=5, splitrule=variance
## + Fold4: mtry=12, min.node.size=5, splitrule=variance
## - Fold4: mtry=12, min.node.size=5, splitrule=variance
## + Fold4: mtry= 2, min.node.size=5, splitrule=extratrees
## - Fold4: mtry= 2, min.node.size=5, splitrule=extratrees
## + Fold4: mtry= 7, min.node.size=5, splitrule=extratrees
## - Fold4: mtry= 7, min.node.size=5, splitrule=extratrees
## + Fold4: mtry=12, min.node.size=5, splitrule=extratrees
## - Fold4: mtry=12, min.node.size=5, splitrule=extratrees
## + Fold5: mtry= 2, min.node.size=5, splitrule=variance
## - Fold5: mtry= 2, min.node.size=5, splitrule=variance
## + Fold5: mtry= 7, min.node.size=5, splitrule=variance
## - Fold5: mtry= 7, min.node.size=5, splitrule=variance
## + Fold5: mtry=12, min.node.size=5, splitrule=variance
## - Fold5: mtry=12, min.node.size=5, splitrule=variance
## + Fold5: mtry= 2, min.node.size=5, splitrule=extratrees
## - Fold5: mtry= 2, min.node.size=5, splitrule=extratrees
## + Fold5: mtry= 7, min.node.size=5, splitrule=extratrees
## - Fold5: mtry= 7, min.node.size=5, splitrule=extratrees
## + Fold5: mtry=12, min.node.size=5, splitrule=extratrees
## - Fold5: mtry=12, min.node.size=5, splitrule=extratrees
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 7, splitrule = variance, min.node.size = 5 on full training set
method = “ranger”
.
tuneLength
to 3.
model
to the console.
# Print model to console
model
## Random Forest
##
## 100 samples
## 12 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 79, 80, 81, 80, 80
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 0.6493381 0.3234349 0.4966282
## 2 extratrees 0.6846140 0.2431224 0.5172347
## 7 variance 0.6246233 0.3767655 0.4770864
## 7 extratrees 0.6706236 0.2665631 0.5062392
## 12 variance 0.6264390 0.3771307 0.4856709
## 12 extratrees 0.6648015 0.2836460 0.5073460
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 7, splitrule = variance
## and min.node.size = 5.
# Plot model
plot(model)
Excellent! You can adjust the tuneLength variable to make a trade-off between runtime and how deep you want to grid-search the model.
3.3 Custom tuning grids
3.3.1 Advantages of a custom tuning grid
Why use a custom tuneGrid
?
-
There’s no advantage; you’ll always end up with the same final model.
-
It gives you more fine-grained control over the tuning parameters that are explored.
-
It always makes your models run faster.
3.3.2 Fit a random forest with custom tuning
Now that you’ve explored the default tuning grids provided by the train()
function, let’s customize your models a bit more.
You can provide any number of values for mtry
, from 2 up to the number of columns in the dataset. In practice, there are diminishing returns for much larger values of mtry
, so you will use a custom tuning grid that explores 2 simple models (mtry = 2
and mtry = 3
) as well as one more complicated model (mtry = 7
).
.mtry
, to a vector of 2
, 3
, and 7
.
.splitrule
, to “variance”
.
.min.node.size
, to 5
.
# From previous step
<- data.frame(
tuneGrid .mtry = c(2, 3, 7),
.splitrule = "variance",
.min.node.size = 5
)
model
, using the wine
dataset on the quality
variable with all other variables as explanatory variables.
method = “ranger”
.
tuneGrid
.
# Fit random forest: model
<- train(
model ~ .,
quality tuneGrid = tuneGrid,
data = wine,
method = "ranger",
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE
) )
## + Fold1: mtry=2, splitrule=variance, min.node.size=5
## - Fold1: mtry=2, splitrule=variance, min.node.size=5
## + Fold1: mtry=3, splitrule=variance, min.node.size=5
## - Fold1: mtry=3, splitrule=variance, min.node.size=5
## + Fold1: mtry=7, splitrule=variance, min.node.size=5
## - Fold1: mtry=7, splitrule=variance, min.node.size=5
## + Fold2: mtry=2, splitrule=variance, min.node.size=5
## - Fold2: mtry=2, splitrule=variance, min.node.size=5
## + Fold2: mtry=3, splitrule=variance, min.node.size=5
## - Fold2: mtry=3, splitrule=variance, min.node.size=5
## + Fold2: mtry=7, splitrule=variance, min.node.size=5
## - Fold2: mtry=7, splitrule=variance, min.node.size=5
## + Fold3: mtry=2, splitrule=variance, min.node.size=5
## - Fold3: mtry=2, splitrule=variance, min.node.size=5
## + Fold3: mtry=3, splitrule=variance, min.node.size=5
## - Fold3: mtry=3, splitrule=variance, min.node.size=5
## + Fold3: mtry=7, splitrule=variance, min.node.size=5
## - Fold3: mtry=7, splitrule=variance, min.node.size=5
## + Fold4: mtry=2, splitrule=variance, min.node.size=5
## - Fold4: mtry=2, splitrule=variance, min.node.size=5
## + Fold4: mtry=3, splitrule=variance, min.node.size=5
## - Fold4: mtry=3, splitrule=variance, min.node.size=5
## + Fold4: mtry=7, splitrule=variance, min.node.size=5
## - Fold4: mtry=7, splitrule=variance, min.node.size=5
## + Fold5: mtry=2, splitrule=variance, min.node.size=5
## - Fold5: mtry=2, splitrule=variance, min.node.size=5
## + Fold5: mtry=3, splitrule=variance, min.node.size=5
## - Fold5: mtry=3, splitrule=variance, min.node.size=5
## + Fold5: mtry=7, splitrule=variance, min.node.size=5
## - Fold5: mtry=7, splitrule=variance, min.node.size=5
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 7, splitrule = variance, min.node.size = 5 on full training set
model
to the console.
# Print model to console
model
## Random Forest
##
## 100 samples
## 12 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 80, 79, 80, 81, 80
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.6709358 0.3161420 0.5148394
## 3 0.6662775 0.3087695 0.5154802
## 7 0.6483430 0.3447539 0.4921923
##
## Tuning parameter 'splitrule' was held constant at a value of variance
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 7, splitrule = variance
## and min.node.size = 5.
plot()
.
# Plot model
plot(model)
Great work! Model tuning plots can be very useful for understanding caret models.
3.4 Introducing glmnet
3.4.1 Advantage of glmnet
What’s the advantage of glmnet
over regular glm
models?
-
glmnet
models automatically find interaction variables. -
glmnet
models don’t provide p-values or confidence intervals on predictions. -
glmnet
models place constraints on your coefficients, which helps prevent overfitting.
3.4.2 Make a custom trainControl
The wine quality dataset was a regression problem, but now you are looking at a classification problem. This is a simulated dataset based on the “don’t overfit” competition on Kaggle a number of years ago.
Classification problems are a little more complicated than regression problems because you have to provide a custom summaryFunction
to the train()
function to use the AUC
metric to rank your models. Start by making a custom trainControl
, as you did in the previous chapter. Be sure to set classProbs = TRUE
, otherwise the twoClassSummary
for summaryFunction
will break.
Make a custom trainControl
called myControl
for classification using the trainControl
function.
- Use 10 CV folds.
-
Use
twoClassSummary
for thesummaryFunction
. -
Be sure to set
classProbs = TRUE
.
# Create custom trainControl: myControl
<- trainControl(
myControl method = "cv",
number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE, # IMPORTANT!
verboseIter = TRUE
)
Great work! Creating a custome trainControl gives you much finer control over how caret searches for models.
3.4.3 Fit glmnet with custom trainControl
Now that you have a custom trainControl
object, fit a glmnet
model to the “don’t overfit” dataset. Recall from the video that glmnet
is an extension of the generalized linear regression model (or glm
)
that places constraints on the magnitude of the coefficients to prevent
overfitting. This is more commonly known as “penalized” regression
modeling and is a very useful technique on datasets with many predictors
and few values.
glmnet
is capable of fitting two different kinds of penalized models, controlled by the alpha
parameter:
-
Ridge regression (or
alpha = 0
) -
Lasso regression (or
alpha = 1
)
You’ll now fit a glmnet
model to the “don’t overfit” dataset using the defaults provided by the caret
package.
glmnet
model called model
on the overfit
data. Use the custom trainControl
from the previous exercise (myControl
). The variable y
is the response variable and all other variables are explanatory variables.
=read.csv("https://assets.datacamp.com/production/repositories/223/datasets/0bd5f7c30d9aec3e1f1fa677a19bee3af407453a/overfit.csv")
overfit# Fit glmnet model: model
<- train(
model ~ .,
y
overfit,method = "glmnet",
trControl = myControl
)
## + Fold01: alpha=0.10, lambda=0.01013
## - Fold01: alpha=0.10, lambda=0.01013
## + Fold01: alpha=0.55, lambda=0.01013
## - Fold01: alpha=0.55, lambda=0.01013
## + Fold01: alpha=1.00, lambda=0.01013
## - Fold01: alpha=1.00, lambda=0.01013
## + Fold02: alpha=0.10, lambda=0.01013
## - Fold02: alpha=0.10, lambda=0.01013
## + Fold02: alpha=0.55, lambda=0.01013
## - Fold02: alpha=0.55, lambda=0.01013
## + Fold02: alpha=1.00, lambda=0.01013
## - Fold02: alpha=1.00, lambda=0.01013
## + Fold03: alpha=0.10, lambda=0.01013
## - Fold03: alpha=0.10, lambda=0.01013
## + Fold03: alpha=0.55, lambda=0.01013
## - Fold03: alpha=0.55, lambda=0.01013
## + Fold03: alpha=1.00, lambda=0.01013
## - Fold03: alpha=1.00, lambda=0.01013
## + Fold04: alpha=0.10, lambda=0.01013
## - Fold04: alpha=0.10, lambda=0.01013
## + Fold04: alpha=0.55, lambda=0.01013
## - Fold04: alpha=0.55, lambda=0.01013
## + Fold04: alpha=1.00, lambda=0.01013
## - Fold04: alpha=1.00, lambda=0.01013
## + Fold05: alpha=0.10, lambda=0.01013
## - Fold05: alpha=0.10, lambda=0.01013
## + Fold05: alpha=0.55, lambda=0.01013
## - Fold05: alpha=0.55, lambda=0.01013
## + Fold05: alpha=1.00, lambda=0.01013
## - Fold05: alpha=1.00, lambda=0.01013
## + Fold06: alpha=0.10, lambda=0.01013
## - Fold06: alpha=0.10, lambda=0.01013
## + Fold06: alpha=0.55, lambda=0.01013
## - Fold06: alpha=0.55, lambda=0.01013
## + Fold06: alpha=1.00, lambda=0.01013
## - Fold06: alpha=1.00, lambda=0.01013
## + Fold07: alpha=0.10, lambda=0.01013
## - Fold07: alpha=0.10, lambda=0.01013
## + Fold07: alpha=0.55, lambda=0.01013
## - Fold07: alpha=0.55, lambda=0.01013
## + Fold07: alpha=1.00, lambda=0.01013
## - Fold07: alpha=1.00, lambda=0.01013
## + Fold08: alpha=0.10, lambda=0.01013
## - Fold08: alpha=0.10, lambda=0.01013
## + Fold08: alpha=0.55, lambda=0.01013
## - Fold08: alpha=0.55, lambda=0.01013
## + Fold08: alpha=1.00, lambda=0.01013
## - Fold08: alpha=1.00, lambda=0.01013
## + Fold09: alpha=0.10, lambda=0.01013
## - Fold09: alpha=0.10, lambda=0.01013
## + Fold09: alpha=0.55, lambda=0.01013
## - Fold09: alpha=0.55, lambda=0.01013
## + Fold09: alpha=1.00, lambda=0.01013
## - Fold09: alpha=1.00, lambda=0.01013
## + Fold10: alpha=0.10, lambda=0.01013
## - Fold10: alpha=0.10, lambda=0.01013
## + Fold10: alpha=0.55, lambda=0.01013
## - Fold10: alpha=0.55, lambda=0.01013
## + Fold10: alpha=1.00, lambda=0.01013
## - Fold10: alpha=1.00, lambda=0.01013
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0.1, lambda = 0.0101 on full training set
# Print model to console
model
## glmnet
##
## 250 samples
## 200 predictors
## 2 classes: 'class1', 'class2'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 225, 225, 225, 224, 225, 225, ...
## Resampling results across tuning parameters:
##
## alpha lambda ROC Sens Spec
## 0.10 0.0001012745 0.4564312 0.1 0.9617754
## 0.10 0.0010127448 0.4524457 0.0 0.9786232
## 0.10 0.0101274483 0.4677536 0.0 0.9916667
## 0.55 0.0001012745 0.4137681 0.1 0.9615942
## 0.55 0.0010127448 0.4310688 0.1 0.9574275
## 0.55 0.0101274483 0.4398551 0.0 0.9789855
## 1.00 0.0001012745 0.4009058 0.1 0.9273551
## 1.00 0.0010127448 0.3989130 0.1 0.9360507
## 1.00 0.0101274483 0.4476449 0.1 0.9748188
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.01012745.
max()
function to find the maximum of the ROC statistic contained somewhere in model[[“results”]]
.
# Print maximum ROC statistic
max(model[["results"]][["ROC"]])
## [1] 0.4677536
Awesome job! This glmnet will use AUC rather than accuracy to select the final model parameters.
3.5 glmnet with custom tuning grid
3.5.1 Why a custom tuning grid?
Why use a custom tuning grid for a glmnet
model?
-
There’s no reason to use a custom grid; the default is always the best.
-
The default tuning grid is very small and there are many more potential
glmnet
models you want to explore. -
glmnet
models are really slow, so you should never try more than a few tuning parameters.
3.5.2 glmnet with custom trainControl and tuning
As you saw in the video, the glmnet
model actually fits
many models at once (one of the great things about the package). You can
exploit this by passing a large number of lambda
values, which control the amount of penalization in the model. train()
is smart enough to only fit one model per alpha
value and pass all of the lambda
values at once for simultaneous fitting.
My favorite tuning grid for glmnet
models is:
expand.grid(
alpha = 0:1,
lambda = seq(0.0001, 1, length = 100)
)
This grid explores a large number of lambda
values (100, in fact), from a very small one to a very large one. (You could increase the maximum lambda
to 10, but in this exercise 1 is a good upper bound.)
If you want to explore fewer models, you can use a shorter lambda sequence. For example, lambda = seq(0.0001, 1, length = 10)
would fit 10 models per value of alpha.
You also look at the two forms of penalized models with this tuneGrid
: ridge regression and lasso regression. alpha = 0
is pure ridge regression, and alpha = 1
is pure lasso regression. You can fit a mixture of the two models (i.e. an elastic net) using an alpha
between 0 and 1. For example, alpha = 0.05
would be 95% ridge regression and 5% lasso regression.
In this problem you’ll just explore the 2 extremes – pure ridge and pure lasso regression – for the purpose of illustrating their differences.
glmnet
model on the overfit
data such that y
is the response variable and all other variables are explanatory variables. Make sure to use your custom trainControl
from the previous exercise (myControl
). Also, use a custom tuneGrid
to explore alpha = 0:1
and 20 values of lambda
between 0.0001 and 1 per value of alpha.
# Train glmnet with custom trainControl and tuning: model
<- train(
model ~ .,
y
overfit,tuneGrid = expand.grid(
alpha = 0:1,
lambda = seq(0.0001, 1, length = 20)
),method = "glmnet",
trControl = myControl
)
## + Fold01: alpha=0, lambda=1
## - Fold01: alpha=0, lambda=1
## + Fold01: alpha=1, lambda=1
## - Fold01: alpha=1, lambda=1
## + Fold02: alpha=0, lambda=1
## - Fold02: alpha=0, lambda=1
## + Fold02: alpha=1, lambda=1
## - Fold02: alpha=1, lambda=1
## + Fold03: alpha=0, lambda=1
## - Fold03: alpha=0, lambda=1
## + Fold03: alpha=1, lambda=1
## - Fold03: alpha=1, lambda=1
## + Fold04: alpha=0, lambda=1
## - Fold04: alpha=0, lambda=1
## + Fold04: alpha=1, lambda=1
## - Fold04: alpha=1, lambda=1
## + Fold05: alpha=0, lambda=1
## - Fold05: alpha=0, lambda=1
## + Fold05: alpha=1, lambda=1
## - Fold05: alpha=1, lambda=1
## + Fold06: alpha=0, lambda=1
## - Fold06: alpha=0, lambda=1
## + Fold06: alpha=1, lambda=1
## - Fold06: alpha=1, lambda=1
## + Fold07: alpha=0, lambda=1
## - Fold07: alpha=0, lambda=1
## + Fold07: alpha=1, lambda=1
## - Fold07: alpha=1, lambda=1
## + Fold08: alpha=0, lambda=1
## - Fold08: alpha=0, lambda=1
## + Fold08: alpha=1, lambda=1
## - Fold08: alpha=1, lambda=1
## + Fold09: alpha=0, lambda=1
## - Fold09: alpha=0, lambda=1
## + Fold09: alpha=1, lambda=1
## - Fold09: alpha=1, lambda=1
## + Fold10: alpha=0, lambda=1
## - Fold10: alpha=0, lambda=1
## + Fold10: alpha=1, lambda=1
## - Fold10: alpha=1, lambda=1
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 1, lambda = 0.0527 on full training set
model
to the console.
# Print model to console
model
## glmnet
##
## 250 samples
## 200 predictors
## 2 classes: 'class1', 'class2'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 225, 224, 226, 224, 225, 225, ...
## Resampling results across tuning parameters:
##
## alpha lambda ROC Sens Spec
## 0 0.00010000 0.4558877 0.0 0.9742754
## 0 0.05272632 0.4473732 0.0 0.9958333
## 0 0.10535263 0.4538949 0.0 1.0000000
## 0 0.15797895 0.4667572 0.0 1.0000000
## 0 0.21060526 0.4753623 0.0 1.0000000
## 0 0.26323158 0.4797101 0.0 1.0000000
## 0 0.31585789 0.4797101 0.0 1.0000000
## 0 0.36848421 0.4797101 0.0 1.0000000
## 0 0.42111053 0.4860507 0.0 1.0000000
## 0 0.47373684 0.4795290 0.0 1.0000000
## 0 0.52636316 0.4837862 0.0 1.0000000
## 0 0.57898947 0.4837862 0.0 1.0000000
## 0 0.63161579 0.4859601 0.0 1.0000000
## 0 0.68424211 0.4859601 0.0 1.0000000
## 0 0.73686842 0.4881341 0.0 1.0000000
## 0 0.78949474 0.4837862 0.0 1.0000000
## 0 0.84212105 0.4816123 0.0 1.0000000
## 0 0.89474737 0.4857790 0.0 1.0000000
## 0 0.94737368 0.4836051 0.0 1.0000000
## 0 1.00000000 0.4836051 0.0 1.0000000
## 1 0.00010000 0.4336051 0.1 0.9443841
## 1 0.05272632 0.5086957 0.0 1.0000000
## 1 0.10535263 0.5000000 0.0 1.0000000
## 1 0.15797895 0.5000000 0.0 1.0000000
## 1 0.21060526 0.5000000 0.0 1.0000000
## 1 0.26323158 0.5000000 0.0 1.0000000
## 1 0.31585789 0.5000000 0.0 1.0000000
## 1 0.36848421 0.5000000 0.0 1.0000000
## 1 0.42111053 0.5000000 0.0 1.0000000
## 1 0.47373684 0.5000000 0.0 1.0000000
## 1 0.52636316 0.5000000 0.0 1.0000000
## 1 0.57898947 0.5000000 0.0 1.0000000
## 1 0.63161579 0.5000000 0.0 1.0000000
## 1 0.68424211 0.5000000 0.0 1.0000000
## 1 0.73686842 0.5000000 0.0 1.0000000
## 1 0.78949474 0.5000000 0.0 1.0000000
## 1 0.84212105 0.5000000 0.0 1.0000000
## 1 0.89474737 0.5000000 0.0 1.0000000
## 1 0.94737368 0.5000000 0.0 1.0000000
## 1 1.00000000 0.5000000 0.0 1.0000000
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.05272632.
max()
of the ROC statistic in model[[“results”]]
. You can access it using model[[“results”]][[“ROC”]]
.
# Print maximum ROC statistic
max(model[["results"]][["ROC"]])
## [1] 0.5086957
Excellent work! I use this custom tuning grid for all my glmnet models – it’s a great place to start!
3.5.3 Interpreting glmnet plots
Here’s the tuning plot for the custom tuned glmnet
model you created in the last exercise. For the overfit
dataset, which value of alpha
is better?
-
alpha = 0
(ridge) -
alpha = 1
(lasso)
Correct! For this dataset, alpha = 1
(or lasso) is better.
4 Preprocessing your data
In this chapter, you will practice using train()
to preprocess data before fitting models, improving your ability to making accurate predictions.
4.1 Median imputation
4.1.1 Median imputation vs. omitting rows
What’s the value of median imputation?
-
It removes some variance from your data, making it easier to model.
-
It lets you model data with missing values.
-
It’s useless; you should just throw out rows of data with any missings.
4.1.2 Apply median imputation
In this chapter, you’ll be using a version of the Wisconsin Breast Cancer dataset. This dataset presents a classic binary classification problem: 50% of the samples are benign, 50% are malignant, and the challenge is to identify which are which.
This dataset is interesting because many of the predictors contain
missing values and most rows of the dataset have at least one missing
value. This presents a modeling challenge, because most machine learning
algorithms cannot handle missing values out of the box. For example,
your first instinct might be to fit a logistic regression model to this
data, but prior to doing this you need a strategy for handling the NA
s.
Fortunately, the train()
function in caret
contains an argument called preProcess
,
which allows you to specify that median imputation should be used to
fill in the missing values. In previous chapters, you created models
with the train()
function using formulas such as y ~ .
. An alternative way is to specify the x
and y
arguments to train()
, where x
is an object with samples in rows and features in columns and y
is a numeric or factor vector containing the outcomes. Said differently, x
is a matrix or data frame that contains the whole dataset you’d use for the data
argument to the lm()
call, for example, but excludes the response variable column; y
is a vector that contains just the response variable column.
For this exercise, the argument x
to train()
is loaded in your workspace as breast_cancer_x
and y
as breast_cancer_y
.
train()
function to fit a glm
model called median_model
to the breast cancer dataset. Use preProcess = “medianImpute”
to handle the missing values.
load("/Users/cliex159/Documents/Rstudio/DataCamp/MachineLearningwithcaretinR/datasets/BreastCancer.RData")
# Apply median imputation: median_model
<- train(
median_model x = breast_cancer_x,
y = breast_cancer_y,
method = "glm",
trControl = myControl,
preProcess = "medianImpute"
)
## + Fold01: parameter=none
## - Fold01: parameter=none
## + Fold02: parameter=none
## - Fold02: parameter=none
## + Fold03: parameter=none
## - Fold03: parameter=none
## + Fold04: parameter=none
## - Fold04: parameter=none
## + Fold05: parameter=none
## - Fold05: parameter=none
## + Fold06: parameter=none
## - Fold06: parameter=none
## + Fold07: parameter=none
## - Fold07: parameter=none
## + Fold08: parameter=none
## - Fold08: parameter=none
## + Fold09: parameter=none
## - Fold09: parameter=none
## + Fold10: parameter=none
## - Fold10: parameter=none
## Aggregating results
## Fitting final model on full training set
median_model
to the console.
# Print median_model to console
median_model
## Generalized Linear Model
##
## 699 samples
## 9 predictor
## 2 classes: 'benign', 'malignant'
##
## Pre-processing: median imputation (9)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 629, 629, 629, 628, 629, 630, ...
## Resampling results:
##
## ROC Sens Spec
## 0.9913969 0.9694203 0.9461667
Fantastic job! Caret makes it very easy to include model preprocessing in your model validation.
4.2 KNN imputation
4.2.1 Comparing KNN imputation to median imputation
Will KNN imputation always be better than median imputation?
-
No, you should try both options and keep the one that gives more accurate models.
-
Yes, KNN is a more complicated model than medians, so it’s always better.
-
No, medians are more statistically valid than KNN and should always be used.
4.2.2 Use KNN imputation
In the previous exercise, you used median imputation to fill in missing values in the breast cancer dataset, but that is not the only possible method for dealing with missing data.
An alternative to median imputation is k-nearest neighbors, or KNN,
imputation. This is a more advanced form of imputation where missing
values are replaced with values from other rows that are similar to the
current row. While this is a lot more complicated to implement in
practice than simple median imputation, it is very easy to explore in caret
using the preProcess
argument to train()
. You can simply use preProcess = “knnImpute”
to change the method of imputation used prior to model fitting.
breast_cancer_x
and breast_cancer_y
are loaded in your workspace.
train()
function to fit a glm
model called knn_model
to the breast cancer dataset.
library(RANN)
# Apply KNN imputation: knn_model
<- train(
knn_model x = breast_cancer_x,
y = breast_cancer_y,
method = "glm",
trControl = myControl,
preProcess = "knnImpute"
)
## + Fold01: parameter=none
## - Fold01: parameter=none
## + Fold02: parameter=none
## - Fold02: parameter=none
## + Fold03: parameter=none
## - Fold03: parameter=none
## + Fold04: parameter=none
## - Fold04: parameter=none
## + Fold05: parameter=none
## - Fold05: parameter=none
## + Fold06: parameter=none
## - Fold06: parameter=none
## + Fold07: parameter=none
## - Fold07: parameter=none
## + Fold08: parameter=none
## - Fold08: parameter=none
## + Fold09: parameter=none
## - Fold09: parameter=none
## + Fold10: parameter=none
## - Fold10: parameter=none
## Aggregating results
## Fitting final model on full training set
# Print knn_model to console
knn_model
## Generalized Linear Model
##
## 699 samples
## 9 predictor
## 2 classes: 'benign', 'malignant'
##
## Pre-processing: nearest neighbor imputation (9), centered (9), scaled (9)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 630, 629, 629, 629, 630, 628, ...
## Resampling results:
##
## ROC Sens Spec
## 0.9909863 0.9715459 0.9376667
Good work! As you can see, you can easily try out different imputation methods.
4.2.3 Compare KNN and median imputation
All of the preprocessing steps in the train()
function
happen in the training set of each cross-validation fold, so the error
metrics reported include the effects of the preprocessing.
This includes the imputation method used (e.g. knnImpute
or medianImpute
).
This is useful because it allows you to compare different methods of
imputation and choose the one that performs the best out-of-sample.
median_model
and knn_model
are available in your workspace, as is resamples
, which contains the resampled results of both models. Look at the results of the models by calling
dotplot(resamples, metric = "ROC")
and choose the one that performs the best out-of-sample. Which method of
imputation yields the highest out-of-sample ROC score for your glm
model?
-
KNN imputation is much better than median imputation.
-
KNN imputation is slightly better than median imputation.
-
Median imputation is much better than KNN imputation.
-
Median imputation is slightly better than KNN imputation.
Nice!
4.3 Multiple preprocessing methods
4.3.1 Order of operations
Which comes first in caret
’s preProcess()
function: median imputation or centering and scaling of variables?
-
Median imputation comes before centering and scaling.
-
Centering and scaling come before median imputation.
4.3.2 Combining preprocessing methods
The preProcess
argument to train()
doesn’t just limit you to imputing missing values. It also includes a wide variety of other preProcess
techniques to make your life as a data scientist much easier. You can read a full list of them by typing ?preProcess
and reading the help page for this function.
One set of preprocessing functions that is particularly useful for fitting regression models is standardization: centering and scaling. You first center by subtracting the mean of each column from each value in that column, then you scale by dividing by the standard deviation.
Standardization transforms your data such that for each column, the mean is 0 and the standard deviation is 1. This makes it easier for regression models to find a good solution.
breast_cancer_x
and breast_cancer_y
are loaded in your workspace. Fit a logistic regression model using median imputation called model
to the breast cancer data, then print it to the console.
# Fit glm with median imputation
<- train(
model x = breast_cancer_x,
y = breast_cancer_y,
method = "glm",
trControl = myControl,
preProcess = "medianImpute"
)
## + Fold01: parameter=none
## - Fold01: parameter=none
## + Fold02: parameter=none
## - Fold02: parameter=none
## + Fold03: parameter=none
## - Fold03: parameter=none
## + Fold04: parameter=none
## - Fold04: parameter=none
## + Fold05: parameter=none
## - Fold05: parameter=none
## + Fold06: parameter=none
## - Fold06: parameter=none
## + Fold07: parameter=none
## - Fold07: parameter=none
## + Fold08: parameter=none
## - Fold08: parameter=none
## + Fold09: parameter=none
## - Fold09: parameter=none
## + Fold10: parameter=none
## - Fold10: parameter=none
## Aggregating results
## Fitting final model on full training set
# Print model
model
## Generalized Linear Model
##
## 699 samples
## 9 predictor
## 2 classes: 'benign', 'malignant'
##
## Pre-processing: median imputation (9)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 629, 629, 629, 628, 629, 630, ...
## Resampling results:
##
## ROC Sens Spec
## 0.992494 0.9695169 0.9458333
Update the model to include two more pre-processing steps: centering and scaling.
# Update model with standardization
<- train(
model x = breast_cancer_x,
y = breast_cancer_y,
method = "glm",
trControl = myControl,
preProcess = c("medianImpute", "center", "scale")
)
## + Fold01: parameter=none
## - Fold01: parameter=none
## + Fold02: parameter=none
## - Fold02: parameter=none
## + Fold03: parameter=none
## - Fold03: parameter=none
## + Fold04: parameter=none
## - Fold04: parameter=none
## + Fold05: parameter=none
## - Fold05: parameter=none
## + Fold06: parameter=none
## - Fold06: parameter=none
## + Fold07: parameter=none
## - Fold07: parameter=none
## + Fold08: parameter=none
## - Fold08: parameter=none
## + Fold09: parameter=none
## - Fold09: parameter=none
## + Fold10: parameter=none
## - Fold10: parameter=none
## Aggregating results
## Fitting final model on full training set
# Print updated model
model
## Generalized Linear Model
##
## 699 samples
## 9 predictor
## 2 classes: 'benign', 'malignant'
##
## Pre-processing: median imputation (9), centered (9), scaled (9)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 629, 628, 629, 629, 629, 630, ...
## Resampling results:
##
## ROC Sens Spec
## 0.9910105 0.9716425 0.9291667
Great work! You can combine many different preprocessing methods with caret.
4.4 Handling low-information predictors
4.4.1 Why remove near zero variance predictors?
What’s the best reason to remove near zero variance predictors from your data before building a model?
-
Because they are guaranteed to have no effect on your model.
-
Because their p-values in a linear regression will always be low.
-
To reduce model-fitting time without reducing model accuracy.
4.4.2 Remove near zero variance predictors
As you saw in the video, for the next set of exercises, you’ll be using the blood-brain dataset. This is a biochemical dataset in which the task is to predict the following value for a set of biochemical compounds:
log((concentration of compound in brain) /
(concentration of compound in blood))
This gives a quantitative metric of the compound’s ability to cross the blood-brain barrier, and is useful for understanding the biological properties of that barrier.
One interesting aspect of this dataset is that it contains many variables and many of these variables have extremely low variances. This means that there is very little information in these variables because they mostly consist of a single value (e.g. zero).
Fortunately, caret
contains a utility function called nearZeroVar()
for removing such variables to save time during modeling.
nearZeroVar()
takes in data x
, then looks at the ratio of the most common value to the second most common value, freqCut
, and the percentage of distinct values out of the number of total samples, uniqueCut
. By default, caret
uses freqCut = 19
and uniqueCut = 10
, which is fairly conservative. I like to be a little more aggressive and use freqCut = 2
and uniqueCut = 20
when calling nearZeroVar()
.
bloodbrain_x
and bloodbrain_y
are loaded in your workspace.
nearZeroVar()
on the blood-brain dataset. Store the result as an object called remove_cols
. Use freqCut = 2
and uniqueCut = 20
in the call to nearZeroVar()
.
load("/Users/cliex159/Documents/Rstudio/DataCamp/MachineLearningwithcaretinR/datasets/BloodBrain.Rdata")
# Identify near zero variance predictors: remove_cols
<- nearZeroVar(bloodbrain_x, names = TRUE,
remove_cols freqCut = 2, uniqueCut = 20)
names()
to create a vector containing all column names of bloodbrain_x
. Call this all_cols
.
# Get all column names from bloodbrain_x: all_cols
<- names(bloodbrain_x) all_cols
bloodbrain_x_small
with the near-zero variance variables removed. Use setdiff()
to isolate the column names that you wish to keep (i.e. that you don’t want to remove.)
# Remove from data: bloodbrain_x_small
<- bloodbrain_x[ , setdiff(all_cols, remove_cols)] bloodbrain_x_small
Great work! Near zero variance variables can cause issues during cross-validation.
4.4.3 preProcess() and nearZeroVar()
Can you use the preProcess
argument in caret
to remove near-zero variance predictors? Or do you have to do this by hand, prior to modeling, using the nearZeroVar()
function?
-
Yes! Set the
preProcess
argument equal to“nzv”
. -
No, unfortunately. You have to do this by hand.
Yes!
4.4.4 Fit model on reduced blood-brain data
Now that you’ve reduced your dataset, you can fit a glm
model to it using the train()
function. This model will run faster than using the full dataset and will yield very similar predictive accuracy.
Furthermore, zero variance variables can cause problems with cross-validation (e.g. if one fold ends up with only a single unique value for that variable), so removing them prior to modeling means you are less likely to get errors during the fitting process.
bloodbrain_x
, bloodbrain_y
, remove
, and bloodbrain_x_small
are loaded in your workspace.
glm
model using the train()
function and the reduced blood-brain dataset you created in the previous exercise.
# Fit model on reduced data: model
<- train(
model x = bloodbrain_x_small,
y = bloodbrain_y,
method = "glm"
)
# Print model to console
model
## Generalized Linear Model
##
## 208 samples
## 112 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 208, 208, 208, 208, 208, 208, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1082.142 0.1091276 257.9763
Excellent job! As discussed previously, glm generates a lot of warnings about convergence, but they’re never a big deal and you can use the out-of-sample accuracy to make sure your model makes good predictions.
4.5 Principle components analysis (PCA)
4.5.1 Using PCA as an alternative to nearZeroVar()
An alternative to removing low-variance predictors is to run PCA on your dataset. This is sometimes preferable because it does not throw out all of your data: many different low variance predictors may end up combined into one high variance PCA variable, which might have a positive impact on your model’s accuracy.
This is an especially good trick for linear models: the pca
option in the preProcess
argument will center and scale your data, combine low variance
variables, and ensure that all of your predictors are orthogonal. This
creates an ideal dataset for linear regression modeling, and can often
improve the accuracy of your models.
bloodbrain_x
and bloodbrain_y
are loaded in your workspace.
glm
model to the full blood-brain dataset using the “pca”
option to preProcess
.
# Fit glm model using PCA: model
<- train(
model x = bloodbrain_x,
y = bloodbrain_y,
method = "glm",
preProcess = "pca"
)
# Print model to console
model
## Generalized Linear Model
##
## 208 samples
## 132 predictors
##
## Pre-processing: principal component signal extraction (132), centered
## (132), scaled (132)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 208, 208, 208, 208, 208, 208, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.6095159 0.4174699 0.4614212
Great work! Note that the PCA model’s accuracy is slightly higher than the nearZeroVar()
model from the previous exercise. PCA is generally a better method for
handling low-information predictors than throwing them out entirely.
5 Selecting models: a case study in churn prediction
In the final chapter of this course, you’ll learn how to use resamples()
to compare multiple models and select (or ensemble) the best one(s).
5.1 Reusing a trainControl
5.1.1 Why reuse a trainControl?
Why reuse a trainControl
?
-
So you can use the same
summaryFunction
and tuning parameters for multiple models. -
So you don’t have to repeat code when fitting multiple models.
-
So you can compare models on the exact same training and test data.
-
All of the above.
5.1.2 Make custom train/test indices
As you saw in the video, for this chapter you will focus on a real-world dataset that brings together all of the concepts discussed in the previous chapters.
The churn dataset contains data on a variety of telecom customers and the modeling challenge is to predict which customers will cancel their service (or churn).
In this chapter, you will be exploring two different types of predictive models: glmnet
and rf
, so the first order of business is to create a reusable trainControl
object you can use to reliably compare them.
churn_x
and churn_y
are loaded in your workspace.
createFolds()
to create 5 CV folds on churn_y
, your target variable for this exercise.
load("/Users/cliex159/Documents/Rstudio/DataCamp/MachineLearningwithcaretinR/datasets/Churn.RData")
# Create custom indices: myFolds
<- createFolds(churn_y, k = 5)
myFolds
churn_x
## stateAK stateAL stateAR stateAZ stateCA stateCO stateCT stateDC stateDE
## 4575 0 0 0 0 0 0 0 0 0
## 4685 0 0 0 0 0 0 0 0 0
## 1431 0 0 0 0 0 0 0 0 0
## 4150 0 0 0 0 0 0 0 0 1
## 3207 0 0 0 0 0 0 0 0 0
## 2593 0 0 0 0 0 0 0 0 0
## 3679 0 0 0 0 0 0 0 0 0
## 673 0 0 0 0 0 0 0 0 0
## 3280 0 0 0 0 0 0 0 0 0
## 3519 0 0 0 0 0 0 0 1 0
## 2285 0 0 0 0 0 0 0 0 0
## 3588 0 0 0 0 0 0 0 0 0
## 4663 0 0 0 0 0 0 0 0 0
## 1274 0 0 0 0 0 0 0 0 0
## 2305 0 0 0 0 0 0 0 0 0
## 4686 0 0 0 0 0 0 0 0 0
## 4876 0 0 0 0 0 0 0 0 0
## 586 0 0 1 0 0 0 0 0 0
## 2367 0 0 0 0 0 0 0 0 0
## 2792 0 0 0 0 0 0 0 0 0
## 4503 0 0 0 0 0 0 1 0 0
## 691 0 0 0 0 0 0 0 0 0
## 4923 0 0 0 0 0 0 0 0 0
## 4712 0 0 0 0 0 0 0 0 0
## 411 0 0 0 0 0 0 0 0 0
## 2559 0 0 1 0 0 0 0 0 0
## 1941 0 0 0 0 0 0 0 0 0
## 4505 0 1 0 0 0 0 0 0 0
## 2223 1 0 0 0 0 0 0 0 0
## 4156 0 0 0 0 0 0 0 0 0
## 3666 0 0 0 0 0 0 0 0 0
## 4031 0 0 0 0 0 0 0 0 0
## 1929 0 0 0 0 0 0 0 0 0
## 3404 0 0 0 0 0 0 0 0 0
## 20 0 0 0 0 0 0 0 0 0
## 4136 0 0 0 0 0 0 0 0 0
## 37 1 0 0 0 0 0 0 0 0
## 1031 0 0 0 0 0 0 0 0 0
## 4499 0 0 0 0 0 0 0 0 0
## 3036 0 0 0 0 0 0 0 0 0
## 1883 0 0 0 0 0 0 0 0 0
## 2161 0 0 0 0 0 0 0 0 0
## 186 0 0 0 0 0 0 0 0 0
## 4826 0 0 0 0 0 0 0 0 0
## 2140 0 0 0 0 0 0 0 0 0
## 4745 0 0 0 0 0 0 0 0 0
## 4398 0 0 0 0 0 0 0 0 0
## 3170 0 0 0 0 0 0 0 0 0
## 4809 0 0 0 0 0 0 0 0 0
## 3064 0 0 0 0 0 0 1 0 0
## 1651 0 0 0 0 0 0 0 0 0
## 1717 0 0 0 0 0 0 0 0 0
## 1972 0 0 0 0 0 0 0 0 0
## 3882 0 0 0 0 0 0 0 0 0
## 193 0 0 0 0 0 0 0 0 0
## 3703 0 0 0 0 0 0 0 0 0
## 3349 0 0 0 0 0 0 0 0 0
## 847 0 0 0 0 0 0 0 0 0
## 1291 0 0 0 0 1 0 0 0 0
## 2542 0 0 0 0 0 0 0 0 0
## 3338 0 0 0 0 0 0 0 0 0
## 4855 0 0 0 0 0 0 0 0 0
## 3751 0 0 0 0 0 0 0 0 0
## 2797 0 0 0 0 0 0 0 0 0
## 4195 0 0 0 0 0 0 0 0 0
## 936 0 0 0 0 0 0 0 0 0
## 1339 0 0 0 0 0 0 0 0 0
## 4086 0 0 0 0 0 0 0 0 0
## 3419 0 0 0 0 0 0 0 0 0
## 1187 0 0 0 0 0 0 0 0 0
## 212 0 0 0 1 0 0 0 0 0
## 693 0 0 0 0 0 0 0 0 0
## 1067 0 0 0 0 0 0 0 0 0
## 2362 0 0 0 0 0 0 0 0 0
## 973 0 0 0 0 0 0 0 0 0
## 3543 0 0 0 0 0 0 0 0 0
## 39 1 0 0 0 0 0 0 0 0
## 1849 0 0 0 0 0 0 0 0 0
## 2532 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0
## 2862 0 0 0 0 1 0 0 0 0
## 777 0 1 0 0 0 0 0 0 0
## 1766 0 0 0 0 0 0 0 0 0
## 3175 0 0 0 0 0 0 0 0 0
## 3814 0 0 0 0 0 0 0 0 0
## 2771 0 0 0 0 0 0 0 0 0
## 1149 0 0 0 0 0 0 0 0 0
## 443 0 0 1 0 0 0 0 0 0
## 421 0 0 0 0 0 0 0 0 0
## 1499 0 0 0 0 0 0 0 0 0
## 3278 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 1024 0 0 1 0 0 0 0 0 0
## 4579 0 0 1 0 0 0 0 0 0
## 4542 0 0 0 0 0 0 0 0 0
## 3601 0 0 0 0 0 0 0 0 0
## 1634 0 0 0 0 0 0 0 0 0
## 2526 0 0 0 0 0 0 0 0 0
## 3647 0 0 0 0 0 0 0 0 0
## 3035 0 0 0 0 0 0 0 0 0
## 3069 0 0 0 0 0 0 0 0 0
## 1064 0 0 0 0 0 0 0 0 0
## 1061 0 0 0 0 0 0 0 0 0
## 1905 0 0 0 0 0 0 0 0 0
## 4615 0 0 0 0 0 0 0 0 0
## 4977 0 0 0 0 0 0 0 0 0
## 3621 0 0 0 0 0 0 0 0 0
## 4989 0 0 0 0 0 0 0 0 0
## 2621 0 0 0 0 0 0 0 0 0
## 12 0 0 0 0 0 0 0 0 0
## 2978 0 0 0 0 0 0 0 0 0
## 4092 0 0 0 0 0 0 0 0 0
## 3674 0 0 0 0 0 0 0 0 0
## 2213 0 1 0 0 0 0 0 0 0
## 2618 0 0 0 0 0 0 0 0 0
## 2626 0 0 1 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0
## 1737 0 0 0 0 0 0 0 0 0
## 2989 0 0 0 0 0 0 0 0 0
## 4047 0 0 0 0 0 0 0 0 0
## 1741 0 0 0 0 0 0 0 0 0
## 2004 0 0 0 0 0 0 0 0 0
## 2798 0 0 0 0 0 0 0 0 0
## 2876 0 0 0 0 0 0 0 0 0
## 3510 0 0 0 0 0 0 0 1 0
## 1926 0 0 0 0 0 0 0 0 0
## 4481 0 0 0 0 0 0 0 0 0
## 4691 0 0 0 0 0 0 0 0 0
## 1138 0 0 0 0 0 0 0 0 0
## 3530 0 0 0 0 0 0 0 0 1
## 4401 0 0 0 0 0 0 0 0 0
## 2939 0 0 0 0 0 0 0 0 0
## 3075 0 0 0 0 0 0 0 0 0
## 4563 0 0 0 0 0 0 0 0 0
## 4139 0 0 0 0 0 0 0 0 0
## 2821 0 0 0 0 0 0 0 0 0
## 3996 0 0 0 0 0 0 0 0 0
## 554 0 0 0 0 0 0 0 0 0
## 3718 0 0 0 0 0 0 0 0 0
## 3032 0 0 0 0 0 0 0 0 0
## 722 0 0 0 0 0 0 0 0 0
## 391 0 0 0 0 0 0 0 0 0
## 2255 0 0 0 0 0 0 0 0 0
## 3786 0 0 0 0 0 0 0 0 0
## 3563 0 0 0 0 0 0 0 0 0
## 3968 0 0 0 0 0 0 0 0 0
## 826 0 0 0 0 0 0 0 0 0
## 4585 0 0 0 0 0 0 0 0 0
## 1425 0 0 0 0 0 0 0 0 0
## 724 0 0 0 0 0 0 0 0 0
## 3489 0 0 0 0 0 0 0 0 0
## 1572 0 0 0 0 0 0 0 0 0
## 3776 0 0 0 0 0 0 0 0 0
## 1912 0 0 0 0 0 1 0 0 0
## 3289 0 0 0 0 0 0 0 0 0
## 3759 0 0 0 0 0 0 0 0 0
## 911 0 0 0 0 0 0 0 0 0
## 141 0 0 0 0 0 0 0 0 1
## 658 1 0 0 0 0 0 0 0 0
## 3293 0 0 0 0 0 0 0 0 0
## 4525 1 0 0 0 0 0 0 0 0
## 2664 0 0 0 0 0 0 0 0 0
## 2912 0 0 0 0 0 0 0 0 0
## 953 0 0 0 0 0 0 0 0 0
## 2589 0 0 0 0 0 0 0 0 0
## 869 0 0 0 0 0 0 0 0 0
## 2185 0 0 0 0 0 0 0 0 0
## 1533 0 0 0 0 1 0 0 0 0
## 562 0 0 0 0 0 0 0 0 0
## 900 0 0 0 0 0 0 0 0 0
## 3525 0 0 0 0 0 0 0 0 0
## 1989 0 0 0 1 0 0 0 0 0
## 2000 0 0 0 0 0 0 0 0 0
## 2319 0 0 0 0 0 0 0 0 0
## 2064 0 0 0 0 0 0 0 0 0
## 659 0 0 0 0 0 0 0 0 0
## 3979 0 0 0 0 0 0 0 0 0
## 2857 0 0 0 0 0 0 0 0 0
## 3831 0 0 0 0 0 0 0 0 1
## 3708 0 0 0 0 0 0 0 0 0
## 4426 0 0 0 0 0 0 0 0 0
## 4158 0 0 0 0 0 0 0 0 0
## 1528 0 0 0 0 0 0 0 0 0
## 1249 0 0 0 0 0 0 0 0 0
## 3575 0 0 0 0 0 0 0 0 0
## 3599 0 0 0 0 0 0 0 0 0
## 4419 0 0 0 0 0 0 0 0 0
## 3818 0 0 0 0 0 0 0 0 0
## 642 0 0 0 0 0 0 0 0 0
## 1385 0 0 0 0 0 0 1 0 0
## 937 0 0 0 0 0 0 0 0 0
## 3771 0 0 0 0 0 0 0 0 0
## 620 0 0 0 0 0 0 0 0 0
## 621 0 0 0 0 0 0 0 0 0
## 348 0 0 0 0 0 0 0 0 0
## 256 0 0 0 0 0 0 0 0 0
## 2556 0 0 0 0 0 0 0 0 0
## 540 0 0 0 0 0 0 0 0 0
## 3569 0 0 0 0 0 0 0 0 0
## 3512 0 0 0 0 0 0 0 0 1
## 4249 0 0 0 0 0 0 0 0 0
## 2482 0 0 0 0 0 0 0 0 0
## 4088 0 0 0 0 0 0 0 0 0
## 2125 0 0 0 0 0 0 0 0 0
## 758 0 0 0 0 0 0 0 0 0
## 2121 0 0 0 0 0 0 0 0 0
## 4640 0 0 0 0 1 0 0 0 0
## 2323 0 0 0 0 0 0 0 0 0
## 1210 0 0 0 0 0 0 1 0 0
## 1245 0 0 0 0 0 0 0 0 0
## 2597 0 0 0 0 0 0 0 1 0
## 3113 0 0 1 0 0 0 0 0 0
## 1611 0 0 0 0 0 0 0 0 0
## 292 0 0 0 0 0 0 0 0 0
## 2160 0 0 0 0 0 0 0 0 0
## 4014 0 0 0 0 0 0 0 0 0
## 2750 0 1 0 0 0 0 0 0 0
## 1691 0 0 0 0 0 0 0 0 0
## 4886 0 0 0 0 0 0 0 0 0
## 4269 0 0 0 1 0 0 0 0 0
## 2343 0 0 0 0 0 0 0 0 0
## 821 0 0 0 0 0 0 0 0 0
## 2595 0 0 0 0 0 0 0 0 0
## 4593 0 0 0 0 0 0 0 0 0
## 4911 0 0 0 0 0 0 0 0 0
## 3918 0 0 0 0 0 0 0 0 0
## 1466 0 0 0 0 0 1 0 0 0
## 886 0 0 0 0 0 0 0 0 0
## 231 0 0 0 0 0 0 0 0 0
## 1173 0 0 0 0 0 0 0 0 0
## 1675 0 0 0 0 0 0 0 0 0
## 759 0 0 0 0 0 0 0 0 0
## 1450 0 0 0 0 0 1 0 0 0
## 84 0 0 0 0 0 0 0 0 0
## 4750 0 0 0 0 0 0 0 0 0
## 3833 0 0 0 0 0 0 0 0 0
## 413 0 0 0 0 0 0 0 0 0
## 4144 0 0 0 0 0 0 0 0 0
## 2641 0 0 0 0 0 0 0 0 0
## 2007 0 0 1 0 0 0 0 0 0
## 322 0 0 0 0 0 0 0 0 0
## 2672 0 0 0 0 0 0 0 0 0
## 337 0 0 0 0 0 0 0 0 0
## 1006 0 0 0 0 0 0 0 0 0
## 2614 0 0 0 0 0 0 0 0 0
## 2292 0 0 0 0 0 0 0 0 0
## 4769 0 0 0 0 0 0 0 0 0
## 711 0 0 0 0 0 0 0 0 0
## 2373 0 0 0 0 0 0 0 0 0
## 4469 0 0 0 0 0 0 0 0 0
## stateFL stateGA stateHI stateIA stateID stateIL stateIN stateKS stateKY
## 4575 0 0 0 0 0 0 0 0 0
## 4685 0 0 0 0 0 0 0 0 0
## 1431 0 0 0 0 0 0 0 0 0
## 4150 0 0 0 0 0 0 0 0 0
## 3207 0 0 0 0 0 0 0 0 0
## 2593 0 0 0 0 0 0 0 0 0
## 3679 0 0 0 0 0 0 0 0 0
## 673 0 0 0 0 0 1 0 0 0
## 3280 0 0 0 0 0 0 0 0 0
## 3519 0 0 0 0 0 0 0 0 0
## 2285 0 0 0 0 0 0 0 0 0
## 3588 0 0 0 0 0 0 0 0 0
## 4663 0 0 0 0 0 1 0 0 0
## 1274 0 0 0 0 0 0 0 0 0
## 2305 0 0 0 0 0 0 0 0 0
## 4686 0 0 0 0 0 0 0 0 0
## 4876 0 0 0 0 0 0 0 0 0
## 586 0 0 0 0 0 0 0 0 0
## 2367 0 0 0 0 0 0 0 0 0
## 2792 0 0 0 0 1 0 0 0 0
## 4503 0 0 0 0 0 0 0 0 0
## 691 0 0 0 0 0 0 0 1 0
## 4923 0 0 0 0 0 0 0 0 0
## 4712 0 0 0 0 0 0 0 0 0
## 411 0 0 0 0 0 0 0 0 0
## 2559 0 0 0 0 0 0 0 0 0
## 1941 0 0 0 0 0 0 0 0 0
## 4505 0 0 0 0 0 0 0 0 0
## 2223 0 0 0 0 0 0 0 0 0
## 4156 0 0 0 0 0 0 0 0 0
## 3666 0 0 0 0 0 0 0 0 0
## 4031 0 0 0 0 1 0 0 0 0
## 1929 0 0 0 0 0 0 0 0 0
## 3404 0 0 0 0 0 0 0 0 0
## 20 0 0 0 0 0 0 0 0 0
## 4136 0 0 0 0 0 0 0 1 0
## 37 0 0 0 0 0 0 0 0 0
## 1031 0 0 0 0 0 0 0 0 0
## 4499 0 0 0 0 0 0 0 0 0
## 3036 0 0 0 0 0 0 0 0 0
## 1883 0 0 1 0 0 0 0 0 0
## 2161 0 0 0 0 0 0 0 0 0
## 186 0 0 0 0 0 0 0 0 0
## 4826 0 0 0 1 0 0 0 0 0
## 2140 0 0 0 0 0 0 0 0 0
## 4745 0 0 0 0 1 0 0 0 0
## 4398 0 0 0 0 0 0 0 0 0
## 3170 0 0 0 0 1 0 0 0 0
## 4809 0 0 0 0 1 0 0 0 0
## 3064 0 0 0 0 0 0 0 0 0
## 1651 0 0 0 0 0 0 0 0 0
## 1717 0 0 0 0 0 0 0 0 0
## 1972 0 0 0 0 0 0 0 1 0
## 3882 0 1 0 0 0 0 0 0 0
## 193 0 0 0 0 0 0 0 0 1
## 3703 0 0 0 0 0 0 0 0 0
## 3349 0 0 0 0 0 0 0 0 0
## 847 0 0 0 0 0 0 0 0 0
## 1291 0 0 0 0 0 0 0 0 0
## 2542 0 0 0 0 0 0 0 0 1
## 3338 0 0 0 0 0 0 0 0 0
## 4855 0 0 0 0 0 0 0 0 0
## 3751 0 0 0 0 0 0 0 0 0
## 2797 0 0 1 0 0 0 0 0 0
## 4195 0 0 0 0 0 0 0 0 0
## 936 0 0 0 0 0 0 0 0 0
## 1339 0 0 0 0 0 0 0 0 0
## 4086 0 0 0 0 0 0 0 0 0
## 3419 0 0 0 0 0 0 0 0 0
## 1187 0 0 0 0 0 0 0 0 0
## 212 0 0 0 0 0 0 0 0 0
## 693 0 0 0 0 0 0 0 0 0
## 1067 0 0 0 0 0 0 0 1 0
## 2362 0 0 0 0 0 0 0 0 0
## 973 0 0 0 0 0 0 0 0 0
## 3543 0 0 0 0 0 0 0 0 0
## 39 0 0 0 0 0 0 0 0 0
## 1849 1 0 0 0 0 0 0 0 0
## 2532 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0
## 2862 0 0 0 0 0 0 0 0 0
## 777 0 0 0 0 0 0 0 0 0
## 1766 0 0 0 0 0 0 0 0 0
## 3175 0 0 0 0 0 0 0 0 0
## 3814 0 0 0 0 0 0 0 0 0
## 2771 0 0 0 0 0 0 0 0 0
## 1149 0 0 0 0 0 0 0 0 0
## 443 0 0 0 0 0 0 0 0 0
## 421 0 0 0 0 0 0 0 0 0
## 1499 0 0 0 0 0 0 0 0 0
## 3278 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 1024 0 0 0 0 0 0 0 0 0
## 4579 0 0 0 0 0 0 0 0 0
## 4542 0 0 0 0 0 0 0 0 0
## 3601 0 0 0 0 0 0 0 0 0
## 1634 0 0 0 0 0 0 0 0 0
## 2526 0 0 0 0 0 0 0 0 0
## 3647 0 0 0 0 0 0 0 0 0
## 3035 0 0 0 0 0 0 0 0 0
## 3069 0 0 0 0 0 0 0 0 0
## 1064 0 0 1 0 0 0 0 0 0
## 1061 0 0 0 0 0 0 0 0 0
## 1905 0 0 0 0 0 0 0 0 0
## 4615 0 0 0 0 0 0 0 0 0
## 4977 0 0 0 0 0 0 0 0 1
## 3621 0 0 0 0 0 0 0 0 0
## 4989 0 0 0 0 0 0 0 0 0
## 2621 0 0 0 0 0 0 0 0 0
## 12 0 0 0 0 0 0 0 0 0
## 2978 0 0 0 0 0 0 0 0 0
## 4092 0 0 0 0 0 0 0 0 0
## 3674 0 0 0 0 0 0 0 0 0
## 2213 0 0 0 0 0 0 0 0 0
## 2618 0 0 0 0 0 0 0 0 0
## 2626 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0
## 1737 0 0 0 0 0 0 0 0 0
## 2989 0 0 0 0 0 0 0 0 0
## 4047 0 0 0 0 0 0 0 0 0
## 1741 0 0 0 0 0 0 0 0 0
## 2004 0 0 0 0 0 0 0 0 0
## 2798 0 0 0 0 0 0 0 0 0
## 2876 0 0 0 0 0 0 0 0 0
## 3510 0 0 0 0 0 0 0 0 0
## 1926 0 0 0 0 0 0 0 0 0
## 4481 0 0 0 0 0 0 0 0 0
## 4691 0 0 0 0 0 0 0 0 0
## 1138 0 0 0 0 0 0 0 0 0
## 3530 0 0 0 0 0 0 0 0 0
## 4401 0 0 0 0 0 0 0 0 0
## 2939 0 0 0 0 0 0 0 0 0
## 3075 0 0 0 0 0 0 0 0 0
## 4563 0 0 0 0 0 0 0 0 0
## 4139 0 0 0 0 0 0 0 0 1
## 2821 0 0 0 0 0 1 0 0 0
## 3996 0 0 0 0 0 0 0 0 1
## 554 0 0 0 0 0 0 0 0 0
## 3718 1 0 0 0 0 0 0 0 0
## 3032 0 0 0 0 0 0 0 0 1
## 722 0 0 0 0 0 0 0 0 0
## 391 0 0 0 0 0 0 0 0 0
## 2255 1 0 0 0 0 0 0 0 0
## 3786 0 0 0 0 0 0 0 0 0
## 3563 0 0 0 0 0 0 0 0 0
## 3968 0 0 0 0 0 0 0 0 0
## 826 0 0 0 0 0 0 0 0 0
## 4585 0 0 0 0 0 0 0 0 0
## 1425 0 0 0 0 0 0 0 0 0
## 724 0 0 0 0 0 0 0 0 0
## 3489 0 0 0 0 0 0 0 0 1
## 1572 0 0 0 0 0 1 0 0 0
## 3776 0 0 0 0 0 0 0 0 0
## 1912 0 0 0 0 0 0 0 0 0
## 3289 0 0 0 0 0 0 0 0 0
## 3759 0 0 0 0 1 0 0 0 0
## 911 0 0 0 0 0 0 0 0 0
## 141 0 0 0 0 0 0 0 0 0
## 658 0 0 0 0 0 0 0 0 0
## 3293 0 0 0 0 0 0 1 0 0
## 4525 0 0 0 0 0 0 0 0 0
## 2664 0 0 0 0 0 0 0 0 0
## 2912 0 0 0 0 0 0 0 0 0
## 953 0 0 0 0 0 0 0 0 0
## 2589 0 0 0 0 0 0 0 0 0
## 869 0 0 0 0 0 0 0 0 0
## 2185 0 0 0 0 1 0 0 0 0
## 1533 0 0 0 0 0 0 0 0 0
## 562 0 0 0 0 0 0 0 0 0
## 900 0 0 0 0 0 0 0 0 0
## 3525 0 0 0 0 0 0 0 0 0
## 1989 0 0 0 0 0 0 0 0 0
## 2000 0 0 0 0 0 0 0 0 0
## 2319 0 0 0 0 0 0 0 0 0
## 2064 0 0 0 0 0 0 0 0 0
## 659 0 0 0 0 0 0 0 0 0
## 3979 0 1 0 0 0 0 0 0 0
## 2857 0 0 0 1 0 0 0 0 0
## 3831 0 0 0 0 0 0 0 0 0
## 3708 0 0 0 0 0 0 0 0 0
## 4426 0 0 0 0 0 0 0 0 0
## 4158 0 0 0 0 0 0 0 0 0
## 1528 0 0 0 1 0 0 0 0 0
## 1249 0 0 0 0 0 0 0 0 0
## 3575 0 0 0 0 0 0 0 0 0
## 3599 0 0 0 0 0 0 0 0 0
## 4419 0 0 0 0 0 0 0 0 0
## 3818 0 0 0 0 1 0 0 0 0
## 642 0 0 0 0 0 0 0 0 0
## 1385 0 0 0 0 0 0 0 0 0
## 937 0 0 0 0 0 0 0 0 0
## 3771 0 0 0 0 0 0 0 0 0
## 620 0 0 0 0 0 0 0 1 0
## 621 0 0 0 0 0 0 0 1 0
## 348 0 0 0 0 0 0 0 0 0
## 256 1 0 0 0 0 0 0 0 0
## 2556 0 0 0 0 0 0 0 0 0
## 540 0 0 0 0 0 0 0 0 0
## 3569 0 0 0 0 0 0 0 1 0
## 3512 0 0 0 0 0 0 0 0 0
## 4249 0 0 0 0 0 0 0 0 0
## 2482 0 0 0 0 0 0 0 0 0
## 4088 0 0 0 0 0 0 0 0 0
## 2125 0 0 0 0 0 0 0 1 0
## 758 0 0 0 0 0 0 0 0 0
## 2121 0 0 0 0 0 0 0 0 0
## 4640 0 0 0 0 0 0 0 0 0
## 2323 0 1 0 0 0 0 0 0 0
## 1210 0 0 0 0 0 0 0 0 0
## 1245 0 0 0 0 0 0 0 0 0
## 2597 0 0 0 0 0 0 0 0 0
## 3113 0 0 0 0 0 0 0 0 0
## 1611 0 0 0 0 0 0 0 0 0
## 292 0 0 0 0 0 0 0 0 0
## 2160 0 0 0 0 0 0 0 1 0
## 4014 0 0 0 0 0 0 0 0 0
## 2750 0 0 0 0 0 0 0 0 0
## 1691 0 0 0 0 0 0 0 0 0
## 4886 0 0 0 0 0 0 0 0 0
## 4269 0 0 0 0 0 0 0 0 0
## 2343 0 0 0 0 0 0 0 0 0
## 821 0 0 0 0 0 0 0 0 0
## 2595 0 0 0 0 0 0 0 0 0
## 4593 0 0 0 0 0 0 0 0 0
## 4911 0 0 0 0 0 0 0 0 0
## 3918 0 0 0 0 0 0 0 0 0
## 1466 0 0 0 0 0 0 0 0 0
## 886 0 0 0 0 0 0 0 0 0
## 231 0 0 0 0 0 0 0 0 0
## 1173 0 0 0 0 0 0 0 0 0
## 1675 0 0 0 0 0 0 0 0 0
## 759 0 0 0 0 0 0 0 0 0
## 1450 0 0 0 0 0 0 0 0 0
## 84 0 1 0 0 0 0 0 0 0
## 4750 0 0 0 0 0 0 0 0 0
## 3833 0 0 0 0 0 0 0 0 0
## 413 0 0 0 0 0 0 0 0 0
## 4144 0 0 0 0 0 0 0 0 0
## 2641 0 0 0 0 0 0 0 0 1
## 2007 0 0 0 0 0 0 0 0 0
## 322 0 0 0 0 0 0 0 0 0
## 2672 0 0 0 0 0 0 0 0 0
## 337 0 0 0 0 0 0 0 0 0
## 1006 0 0 0 0 0 0 0 0 0
## 2614 0 0 0 0 0 0 0 0 0
## 2292 0 0 0 0 0 0 0 0 0
## 4769 0 0 0 0 0 0 0 0 0
## 711 0 0 0 0 0 0 0 0 0
## 2373 0 0 0 0 0 0 0 0 0
## 4469 0 0 0 0 1 0 0 0 0
## stateLA stateMA stateMD stateME stateMI stateMN stateMO stateMS stateMT
## 4575 0 0 0 0 0 0 0 0 0
## 4685 0 0 0 0 0 0 1 0 0
## 1431 0 0 0 0 0 0 0 0 0
## 4150 0 0 0 0 0 0 0 0 0
## 3207 0 0 0 0 0 0 0 0 0
## 2593 0 0 0 0 0 0 0 0 0
## 3679 0 0 0 0 0 0 0 0 0
## 673 0 0 0 0 0 0 0 0 0
## 3280 0 0 0 0 0 0 0 0 0
## 3519 0 0 0 0 0 0 0 0 0
## 2285 0 0 0 0 0 0 0 0 0
## 3588 0 0 0 0 0 0 0 0 0
## 4663 0 0 0 0 0 0 0 0 0
## 1274 0 0 0 0 0 0 0 0 1
## 2305 0 0 0 0 0 0 0 0 0
## 4686 0 0 0 0 0 0 0 0 0
## 4876 0 0 0 0 0 0 0 0 0
## 586 0 0 0 0 0 0 0 0 0
## 2367 0 0 0 0 0 0 0 1 0
## 2792 0 0 0 0 0 0 0 0 0
## 4503 0 0 0 0 0 0 0 0 0
## 691 0 0 0 0 0 0 0 0 0
## 4923 0 0 0 0 0 1 0 0 0
## 4712 0 0 0 0 0 0 0 0 0
## 411 0 0 0 0 0 0 0 0 0
## 2559 0 0 0 0 0 0 0 0 0
## 1941 0 0 0 0 0 0 0 0 0
## 4505 0 0 0 0 0 0 0 0 0
## 2223 0 0 0 0 0 0 0 0 0
## 4156 0 0 0 0 0 0 0 0 0
## 3666 0 0 0 0 0 0 0 0 0
## 4031 0 0 0 0 0 0 0 0 0
## 1929 0 0 0 0 0 0 0 0 0
## 3404 0 0 0 0 0 0 0 0 0
## 20 0 0 0 0 0 0 0 0 0
## 4136 0 0 0 0 0 0 0 0 0
## 37 0 0 0 0 0 0 0 0 0
## 1031 0 0 0 0 0 0 0 0 0
## 4499 0 0 0 0 0 0 0 0 0
## 3036 0 0 0 1 0 0 0 0 0
## 1883 0 0 0 0 0 0 0 0 0
## 2161 0 0 0 0 0 0 0 0 0
## 186 0 0 0 0 0 0 0 0 0
## 4826 0 0 0 0 0 0 0 0 0
## 2140 0 0 0 0 0 0 0 0 0
## 4745 0 0 0 0 0 0 0 0 0
## 4398 0 0 0 0 0 0 0 0 0
## 3170 0 0 0 0 0 0 0 0 0
## 4809 0 0 0 0 0 0 0 0 0
## 3064 0 0 0 0 0 0 0 0 0
## 1651 0 0 0 0 0 0 0 0 0
## 1717 0 0 0 1 0 0 0 0 0
## 1972 0 0 0 0 0 0 0 0 0
## 3882 0 0 0 0 0 0 0 0 0
## 193 0 0 0 0 0 0 0 0 0
## 3703 0 0 0 0 0 0 0 0 0
## 3349 0 0 0 0 0 1 0 0 0
## 847 0 0 0 0 0 0 0 0 0
## 1291 0 0 0 0 0 0 0 0 0
## 2542 0 0 0 0 0 0 0 0 0
## 3338 0 0 0 0 0 0 0 0 0
## 4855 0 0 0 0 0 0 0 0 0
## 3751 0 0 0 0 0 0 0 0 0
## 2797 0 0 0 0 0 0 0 0 0
## 4195 0 0 0 0 0 0 1 0 0
## 936 0 0 1 0 0 0 0 0 0
## 1339 0 0 0 0 0 0 0 0 0
## 4086 0 0 0 0 0 0 0 0 0
## 3419 0 0 0 0 0 0 0 0 0
## 1187 0 0 0 0 0 0 0 0 0
## 212 0 0 0 0 0 0 0 0 0
## 693 0 0 0 0 0 0 0 0 0
## 1067 0 0 0 0 0 0 0 0 0
## 2362 0 1 0 0 0 0 0 0 0
## 973 1 0 0 0 0 0 0 0 0
## 3543 0 0 0 0 0 1 0 0 0
## 39 0 0 0 0 0 0 0 0 0
## 1849 0 0 0 0 0 0 0 0 0
## 2532 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 1 0 0
## 2862 0 0 0 0 0 0 0 0 0
## 777 0 0 0 0 0 0 0 0 0
## 1766 0 0 0 0 0 0 0 0 0
## 3175 0 0 0 0 0 0 0 0 0
## 3814 0 0 0 0 0 0 0 0 0
## 2771 0 0 0 0 0 0 0 0 0
## 1149 0 0 0 0 0 0 0 0 0
## 443 0 0 0 0 0 0 0 0 0
## 421 0 0 0 0 0 0 0 0 0
## 1499 0 0 0 0 0 0 0 0 0
## 3278 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 1024 0 0 0 0 0 0 0 0 0
## 4579 0 0 0 0 0 0 0 0 0
## 4542 1 0 0 0 0 0 0 0 0
## 3601 0 0 0 0 0 0 1 0 0
## 1634 0 0 0 0 0 0 0 0 0
## 2526 0 0 0 0 0 0 0 0 0
## 3647 0 0 0 0 0 1 0 0 0
## 3035 0 0 0 0 0 0 0 0 0
## 3069 0 0 0 0 0 0 0 0 0
## 1064 0 0 0 0 0 0 0 0 0
## 1061 0 0 0 0 0 0 0 0 0
## 1905 0 0 0 0 0 0 0 0 0
## 4615 0 0 0 0 0 0 0 0 0
## 4977 0 0 0 0 0 0 0 0 0
## 3621 0 0 0 0 0 0 0 0 0
## 4989 0 0 0 0 0 0 0 0 0
## 2621 0 0 0 0 0 0 0 0 0
## 12 0 0 0 0 0 0 0 0 0
## 2978 0 0 0 0 0 0 0 0 1
## 4092 0 0 0 0 0 0 0 0 1
## 3674 0 0 0 0 0 0 0 0 0
## 2213 0 0 0 0 0 0 0 0 0
## 2618 0 0 0 0 0 0 0 0 0
## 2626 0 0 0 0 0 0 0 0 0
## 7 0 1 0 0 0 0 0 0 0
## 1737 0 0 0 0 0 0 0 0 0
## 2989 0 0 0 0 0 0 0 0 0
## 4047 0 0 0 0 0 0 0 0 1
## 1741 0 0 0 0 0 0 0 0 0
## 2004 0 0 0 0 0 0 0 0 0
## 2798 0 0 0 0 0 0 0 0 0
## 2876 1 0 0 0 0 0 0 0 0
## 3510 0 0 0 0 0 0 0 0 0
## 1926 0 0 0 0 0 0 0 0 0
## 4481 0 0 0 0 0 0 0 0 0
## 4691 0 0 0 0 0 1 0 0 0
## 1138 0 0 0 0 0 0 0 0 0
## 3530 0 0 0 0 0 0 0 0 0
## 4401 0 0 0 0 0 0 0 0 0
## 2939 0 0 0 0 0 0 0 0 0
## 3075 0 0 0 0 0 0 0 0 0
## 4563 0 0 0 0 0 0 0 0 0
## 4139 0 0 0 0 0 0 0 0 0
## 2821 0 0 0 0 0 0 0 0 0
## 3996 0 0 0 0 0 0 0 0 0
## 554 0 0 0 0 0 0 0 0 0
## 3718 0 0 0 0 0 0 0 0 0
## 3032 0 0 0 0 0 0 0 0 0
## 722 0 0 0 0 0 0 0 0 0
## 391 0 0 0 0 0 0 0 0 0
## 2255 0 0 0 0 0 0 0 0 0
## 3786 0 0 0 0 0 0 0 0 0
## 3563 0 0 0 0 0 0 0 0 0
## 3968 0 0 0 0 0 0 0 0 0
## 826 0 0 0 0 0 0 1 0 0
## 4585 0 0 0 0 0 0 0 0 0
## 1425 0 0 0 0 0 0 0 0 0
## 724 0 0 0 0 1 0 0 0 0
## 3489 0 0 0 0 0 0 0 0 0
## 1572 0 0 0 0 0 0 0 0 0
## 3776 0 0 0 0 0 0 0 0 0
## 1912 0 0 0 0 0 0 0 0 0
## 3289 0 0 0 0 0 0 0 0 0
## 3759 0 0 0 0 0 0 0 0 0
## 911 0 0 0 0 0 0 0 0 0
## 141 0 0 0 0 0 0 0 0 0
## 658 0 0 0 0 0 0 0 0 0
## 3293 0 0 0 0 0 0 0 0 0
## 4525 0 0 0 0 0 0 0 0 0
## 2664 0 0 0 0 0 0 0 0 0
## 2912 0 0 0 0 0 0 0 0 0
## 953 0 0 0 0 0 0 0 0 0
## 2589 0 0 0 0 1 0 0 0 0
## 869 0 0 1 0 0 0 0 0 0
## 2185 0 0 0 0 0 0 0 0 0
## 1533 0 0 0 0 0 0 0 0 0
## 562 0 0 0 0 0 0 0 0 0
## 900 0 0 0 0 0 0 0 0 0
## 3525 0 0 0 0 0 0 0 0 0
## 1989 0 0 0 0 0 0 0 0 0
## 2000 0 0 0 0 0 0 0 0 0
## 2319 0 0 0 0 0 0 0 0 0
## 2064 0 0 1 0 0 0 0 0 0
## 659 0 0 0 0 0 0 0 0 0
## 3979 0 0 0 0 0 0 0 0 0
## 2857 0 0 0 0 0 0 0 0 0
## 3831 0 0 0 0 0 0 0 0 0
## 3708 0 0 0 0 0 0 0 0 0
## 4426 0 0 0 0 0 1 0 0 0
## 4158 0 0 0 0 0 0 0 0 0
## 1528 0 0 0 0 0 0 0 0 0
## 1249 0 0 0 0 0 0 0 0 0
## 3575 0 0 0 0 0 0 0 0 0
## 3599 0 0 0 0 0 0 0 0 0
## 4419 0 0 0 0 0 0 0 0 0
## 3818 0 0 0 0 0 0 0 0 0
## 642 0 0 0 0 0 0 0 0 0
## 1385 0 0 0 0 0 0 0 0 0
## 937 0 0 0 0 0 0 0 0 0
## 3771 0 0 0 0 0 0 0 0 0
## 620 0 0 0 0 0 0 0 0 0
## 621 0 0 0 0 0 0 0 0 0
## 348 0 0 0 0 0 0 0 0 0
## 256 0 0 0 0 0 0 0 0 0
## 2556 0 0 0 0 0 0 0 0 0
## 540 0 0 0 0 0 0 0 0 0
## 3569 0 0 0 0 0 0 0 0 0
## 3512 0 0 0 0 0 0 0 0 0
## 4249 0 0 0 0 0 0 1 0 0
## 2482 0 0 0 0 0 0 0 0 0
## 4088 0 0 0 0 0 0 0 0 0
## 2125 0 0 0 0 0 0 0 0 0
## 758 0 0 0 0 0 0 0 0 0
## 2121 0 0 0 0 0 0 0 0 0
## 4640 0 0 0 0 0 0 0 0 0
## 2323 0 0 0 0 0 0 0 0 0
## 1210 0 0 0 0 0 0 0 0 0
## 1245 0 0 0 0 0 0 0 0 0
## 2597 0 0 0 0 0 0 0 0 0
## 3113 0 0 0 0 0 0 0 0 0
## 1611 0 0 0 0 0 0 0 0 0
## 292 0 0 0 0 0 0 0 0 0
## 2160 0 0 0 0 0 0 0 0 0
## 4014 0 0 0 0 0 0 0 0 0
## 2750 0 0 0 0 0 0 0 0 0
## 1691 0 0 0 0 0 0 0 0 0
## 4886 0 0 0 1 0 0 0 0 0
## 4269 0 0 0 0 0 0 0 0 0
## 2343 0 0 0 0 1 0 0 0 0
## 821 0 0 0 0 0 0 0 0 0
## 2595 0 0 0 0 0 0 0 0 0
## 4593 0 0 0 0 0 0 0 0 0
## 4911 0 0 0 0 0 0 0 0 0
## 3918 0 0 0 0 0 0 0 0 0
## 1466 0 0 0 0 0 0 0 0 0
## 886 0 0 0 1 0 0 0 0 0
## 231 0 0 1 0 0 0 0 0 0
## 1173 0 1 0 0 0 0 0 0 0
## 1675 0 0 0 0 0 0 0 0 0
## 759 1 0 0 0 0 0 0 0 0
## 1450 0 0 0 0 0 0 0 0 0
## 84 0 0 0 0 0 0 0 0 0
## 4750 0 0 0 0 0 0 0 0 0
## 3833 0 0 0 0 0 0 0 0 0
## 413 0 0 0 0 0 0 0 0 0
## 4144 0 0 0 0 0 0 0 0 0
## 2641 0 0 0 0 0 0 0 0 0
## 2007 0 0 0 0 0 0 0 0 0
## 322 0 0 0 0 0 0 0 0 0
## 2672 0 0 0 0 0 0 0 0 0
## 337 0 0 0 0 0 0 0 0 0
## 1006 0 0 0 0 0 0 0 0 0
## 2614 0 0 0 0 1 0 0 0 0
## 2292 0 0 0 0 0 0 0 0 0
## 4769 0 0 0 1 0 0 0 0 0
## 711 0 0 0 0 0 0 0 0 0
## 2373 0 0 0 0 0 0 0 0 0
## 4469 0 0 0 0 0 0 0 0 0
## stateNC stateND stateNE stateNH stateNJ stateNM stateNV stateNY stateOH
## 4575 0 0 0 0 0 0 0 0 0
## 4685 0 0 0 0 0 0 0 0 0
## 1431 0 0 0 0 0 0 0 0 0
## 4150 0 0 0 0 0 0 0 0 0
## 3207 0 0 0 0 0 0 0 0 0
## 2593 0 0 0 0 0 0 0 0 0
## 3679 0 0 0 0 0 0 1 0 0
## 673 0 0 0 0 0 0 0 0 0
## 3280 0 0 0 0 1 0 0 0 0
## 3519 0 0 0 0 0 0 0 0 0
## 2285 0 0 0 0 0 0 0 0 0
## 3588 0 0 0 0 0 0 0 0 0
## 4663 0 0 0 0 0 0 0 0 0
## 1274 0 0 0 0 0 0 0 0 0
## 2305 0 0 0 0 0 0 0 0 0
## 4686 0 0 0 0 0 0 0 0 0
## 4876 0 0 0 0 0 0 0 0 0
## 586 0 0 0 0 0 0 0 0 0
## 2367 0 0 0 0 0 0 0 0 0
## 2792 0 0 0 0 0 0 0 0 0
## 4503 0 0 0 0 0 0 0 0 0
## 691 0 0 0 0 0 0 0 0 0
## 4923 0 0 0 0 0 0 0 0 0
## 4712 0 0 0 0 0 0 0 0 0
## 411 0 0 0 0 0 0 0 0 0
## 2559 0 0 0 0 0 0 0 0 0
## 1941 0 0 0 0 0 0 0 0 0
## 4505 0 0 0 0 0 0 0 0 0
## 2223 0 0 0 0 0 0 0 0 0
## 4156 0 0 0 0 0 0 0 0 0
## 3666 0 0 0 0 0 0 0 0 0
## 4031 0 0 0 0 0 0 0 0 0
## 1929 0 0 0 0 0 0 0 0 0
## 3404 0 0 0 0 0 0 0 0 0
## 20 0 0 0 0 0 0 0 0 0
## 4136 0 0 0 0 0 0 0 0 0
## 37 0 0 0 0 0 0 0 0 0
## 1031 0 0 0 0 0 0 0 0 0
## 4499 0 0 0 0 0 0 0 1 0
## 3036 0 0 0 0 0 0 0 0 0
## 1883 0 0 0 0 0 0 0 0 0
## 2161 0 0 0 0 0 0 1 0 0
## 186 0 0 0 0 0 1 0 0 0
## 4826 0 0 0 0 0 0 0 0 0
## 2140 0 0 0 0 0 0 0 0 0
## 4745 0 0 0 0 0 0 0 0 0
## 4398 0 0 0 0 0 0 0 0 1
## 3170 0 0 0 0 0 0 0 0 0
## 4809 0 0 0 0 0 0 0 0 0
## 3064 0 0 0 0 0 0 0 0 0
## 1651 0 0 0 0 0 0 0 0 0
## 1717 0 0 0 0 0 0 0 0 0
## 1972 0 0 0 0 0 0 0 0 0
## 3882 0 0 0 0 0 0 0 0 0
## 193 0 0 0 0 0 0 0 0 0
## 3703 0 0 0 0 0 0 0 0 0
## 3349 0 0 0 0 0 0 0 0 0
## 847 0 0 1 0 0 0 0 0 0
## 1291 0 0 0 0 0 0 0 0 0
## 2542 0 0 0 0 0 0 0 0 0
## 3338 0 0 0 0 0 0 0 0 0
## 4855 0 0 0 0 0 0 0 0 0
## 3751 0 1 0 0 0 0 0 0 0
## 2797 0 0 0 0 0 0 0 0 0
## 4195 0 0 0 0 0 0 0 0 0
## 936 0 0 0 0 0 0 0 0 0
## 1339 0 0 0 0 0 0 0 0 0
## 4086 0 0 0 0 0 0 0 0 0
## 3419 0 0 0 0 0 0 0 0 0
## 1187 0 0 0 0 0 0 0 0 0
## 212 0 0 0 0 0 0 0 0 0
## 693 0 0 1 0 0 0 0 0 0
## 1067 0 0 0 0 0 0 0 0 0
## 2362 0 0 0 0 0 0 0 0 0
## 973 0 0 0 0 0 0 0 0 0
## 3543 0 0 0 0 0 0 0 0 0
## 39 0 0 0 0 0 0 0 0 0
## 1849 0 0 0 0 0 0 0 0 0
## 2532 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0
## 2862 0 0 0 0 0 0 0 0 0
## 777 0 0 0 0 0 0 0 0 0
## 1766 0 0 0 0 0 1 0 0 0
## 3175 0 0 0 0 0 0 0 0 0
## 3814 0 0 0 0 0 0 0 0 0
## 2771 0 0 0 1 0 0 0 0 0
## 1149 0 0 0 0 0 0 0 1 0
## 443 0 0 0 0 0 0 0 0 0
## 421 0 0 0 0 0 1 0 0 0
## 1499 0 0 0 1 0 0 0 0 0
## 3278 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 1
## 1024 0 0 0 0 0 0 0 0 0
## 4579 0 0 0 0 0 0 0 0 0
## 4542 0 0 0 0 0 0 0 0 0
## 3601 0 0 0 0 0 0 0 0 0
## 1634 0 0 0 0 0 0 0 1 0
## 2526 0 0 0 0 0 0 0 0 0
## 3647 0 0 0 0 0 0 0 0 0
## 3035 0 0 0 0 1 0 0 0 0
## 3069 0 0 0 0 0 0 0 0 0
## 1064 0 0 0 0 0 0 0 0 0
## 1061 0 0 0 0 0 0 0 0 1
## 1905 0 0 0 0 0 0 0 0 0
## 4615 0 0 0 0 0 0 0 0 0
## 4977 0 0 0 0 0 0 0 0 0
## 3621 0 0 0 0 0 0 0 0 1
## 4989 0 0 0 0 0 0 0 0 0
## 2621 0 0 0 0 0 0 0 0 0
## 12 0 0 0 0 0 0 0 0 0
## 2978 0 0 0 0 0 0 0 0 0
## 4092 0 0 0 0 0 0 0 0 0
## 3674 0 0 0 1 0 0 0 0 0
## 2213 0 0 0 0 0 0 0 0 0
## 2618 0 0 0 0 0 0 0 0 0
## 2626 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0
## 1737 0 0 0 0 0 0 1 0 0
## 2989 0 0 0 0 0 0 0 0 0
## 4047 0 0 0 0 0 0 0 0 0
## 1741 0 1 0 0 0 0 0 0 0
## 2004 0 0 0 0 0 0 0 0 0
## 2798 1 0 0 0 0 0 0 0 0
## 2876 0 0 0 0 0 0 0 0 0
## 3510 0 0 0 0 0 0 0 0 0
## 1926 0 0 0 0 1 0 0 0 0
## 4481 0 0 0 0 0 0 0 1 0
## 4691 0 0 0 0 0 0 0 0 0
## 1138 0 0 0 0 0 0 0 0 0
## 3530 0 0 0 0 0 0 0 0 0
## 4401 0 0 0 0 0 0 0 0 0
## 2939 0 0 1 0 0 0 0 0 0
## 3075 0 0 0 0 0 0 0 0 0
## 4563 0 0 0 0 0 1 0 0 0
## 4139 0 0 0 0 0 0 0 0 0
## 2821 0 0 0 0 0 0 0 0 0
## 3996 0 0 0 0 0 0 0 0 0
## 554 0 0 0 0 0 0 0 0 0
## 3718 0 0 0 0 0 0 0 0 0
## 3032 0 0 0 0 0 0 0 0 0
## 722 0 0 0 0 0 0 0 0 0
## 391 0 0 0 0 0 0 0 0 0
## 2255 0 0 0 0 0 0 0 0 0
## 3786 0 0 0 0 1 0 0 0 0
## 3563 0 0 0 0 0 0 0 0 0
## 3968 0 0 0 0 0 0 0 0 0
## 826 0 0 0 0 0 0 0 0 0
## 4585 0 0 0 0 0 0 0 0 0
## 1425 0 0 0 0 0 0 0 0 0
## 724 0 0 0 0 0 0 0 0 0
## 3489 0 0 0 0 0 0 0 0 0
## 1572 0 0 0 0 0 0 0 0 0
## 3776 0 0 0 0 1 0 0 0 0
## 1912 0 0 0 0 0 0 0 0 0
## 3289 0 0 0 0 0 0 0 0 0
## 3759 0 0 0 0 0 0 0 0 0
## 911 0 0 0 0 1 0 0 0 0
## 141 0 0 0 0 0 0 0 0 0
## 658 0 0 0 0 0 0 0 0 0
## 3293 0 0 0 0 0 0 0 0 0
## 4525 0 0 0 0 0 0 0 0 0
## 2664 0 0 0 0 0 0 0 0 0
## 2912 0 0 0 0 0 1 0 0 0
## 953 0 0 0 0 0 0 0 0 0
## 2589 0 0 0 0 0 0 0 0 0
## 869 0 0 0 0 0 0 0 0 0
## 2185 0 0 0 0 0 0 0 0 0
## 1533 0 0 0 0 0 0 0 0 0
## 562 0 0 0 0 0 0 0 0 0
## 900 0 0 0 0 0 0 0 0 0
## 3525 0 0 0 0 0 0 0 0 0
## 1989 0 0 0 0 0 0 0 0 0
## 2000 0 0 0 0 0 0 0 0 0
## 2319 0 0 0 0 0 0 0 0 0
## 2064 0 0 0 0 0 0 0 0 0
## 659 0 0 0 0 0 0 0 0 0
## 3979 0 0 0 0 0 0 0 0 0
## 2857 0 0 0 0 0 0 0 0 0
## 3831 0 0 0 0 0 0 0 0 0
## 3708 0 0 0 1 0 0 0 0 0
## 4426 0 0 0 0 0 0 0 0 0
## 4158 0 0 0 0 0 0 0 0 0
## 1528 0 0 0 0 0 0 0 0 0
## 1249 0 0 0 0 0 0 0 0 0
## 3575 0 0 0 0 0 0 0 0 0
## 3599 0 0 0 0 1 0 0 0 0
## 4419 0 0 1 0 0 0 0 0 0
## 3818 0 0 0 0 0 0 0 0 0
## 642 0 0 0 0 0 0 0 0 0
## 1385 0 0 0 0 0 0 0 0 0
## 937 0 0 1 0 0 0 0 0 0
## 3771 0 0 0 0 0 0 0 1 0
## 620 0 0 0 0 0 0 0 0 0
## 621 0 0 0 0 0 0 0 0 0
## 348 0 0 0 0 0 0 0 0 0
## 256 0 0 0 0 0 0 0 0 0
## 2556 1 0 0 0 0 0 0 0 0
## 540 0 0 0 0 0 0 0 1 0
## 3569 0 0 0 0 0 0 0 0 0
## 3512 0 0 0 0 0 0 0 0 0
## 4249 0 0 0 0 0 0 0 0 0
## 2482 0 0 0 0 0 0 0 0 0
## 4088 0 0 0 0 0 0 0 0 1
## 2125 0 0 0 0 0 0 0 0 0
## 758 0 0 0 0 0 0 0 0 0
## 2121 0 0 0 0 0 0 0 0 0
## 4640 0 0 0 0 0 0 0 0 0
## 2323 0 0 0 0 0 0 0 0 0
## 1210 0 0 0 0 0 0 0 0 0
## 1245 1 0 0 0 0 0 0 0 0
## 2597 0 0 0 0 0 0 0 0 0
## 3113 0 0 0 0 0 0 0 0 0
## 1611 0 0 0 0 0 0 0 0 0
## 292 0 0 1 0 0 0 0 0 0
## 2160 0 0 0 0 0 0 0 0 0
## 4014 0 0 0 0 0 0 0 0 0
## 2750 0 0 0 0 0 0 0 0 0
## 1691 0 0 0 0 0 0 0 0 0
## 4886 0 0 0 0 0 0 0 0 0
## 4269 0 0 0 0 0 0 0 0 0
## 2343 0 0 0 0 0 0 0 0 0
## 821 0 0 0 0 0 0 0 0 0
## 2595 0 0 0 0 0 0 0 0 1
## 4593 0 0 0 0 0 0 0 0 0
## 4911 0 0 0 0 0 0 1 0 0
## 3918 0 0 0 0 0 0 0 0 0
## 1466 0 0 0 0 0 0 0 0 0
## 886 0 0 0 0 0 0 0 0 0
## 231 0 0 0 0 0 0 0 0 0
## 1173 0 0 0 0 0 0 0 0 0
## 1675 0 0 0 0 0 0 1 0 0
## 759 0 0 0 0 0 0 0 0 0
## 1450 0 0 0 0 0 0 0 0 0
## 84 0 0 0 0 0 0 0 0 0
## 4750 0 0 1 0 0 0 0 0 0
## 3833 0 0 0 0 0 0 0 0 0
## 413 0 0 1 0 0 0 0 0 0
## 4144 0 0 0 0 0 0 0 0 0
## 2641 0 0 0 0 0 0 0 0 0
## 2007 0 0 0 0 0 0 0 0 0
## 322 0 0 0 0 0 0 0 0 0
## 2672 0 0 0 0 0 0 0 0 0
## 337 0 0 0 0 0 0 0 0 0
## 1006 0 1 0 0 0 0 0 0 0
## 2614 0 0 0 0 0 0 0 0 0
## 2292 0 0 0 0 0 0 0 0 0
## 4769 0 0 0 0 0 0 0 0 0
## 711 0 0 0 0 0 0 0 0 0
## 2373 0 0 0 0 0 0 0 0 1
## 4469 0 0 0 0 0 0 0 0 0
## stateOK stateOR statePA stateRI stateSC stateSD stateTN stateTX stateUT
## 4575 0 0 0 0 1 0 0 0 0
## 4685 0 0 0 0 0 0 0 0 0
## 1431 0 0 0 0 0 0 0 0 0
## 4150 0 0 0 0 0 0 0 0 0
## 3207 0 0 0 0 0 0 0 0 0
## 2593 0 0 0 0 0 0 0 0 0
## 3679 0 0 0 0 0 0 0 0 0
## 673 0 0 0 0 0 0 0 0 0
## 3280 0 0 0 0 0 0 0 0 0
## 3519 0 0 0 0 0 0 0 0 0
## 2285 0 0 1 0 0 0 0 0 0
## 3588 0 0 0 0 0 0 0 0 0
## 4663 0 0 0 0 0 0 0 0 0
## 1274 0 0 0 0 0 0 0 0 0
## 2305 1 0 0 0 0 0 0 0 0
## 4686 0 1 0 0 0 0 0 0 0
## 4876 0 0 0 0 1 0 0 0 0
## 586 0 0 0 0 0 0 0 0 0
## 2367 0 0 0 0 0 0 0 0 0
## 2792 0 0 0 0 0 0 0 0 0
## 4503 0 0 0 0 0 0 0 0 0
## 691 0 0 0 0 0 0 0 0 0
## 4923 0 0 0 0 0 0 0 0 0
## 4712 0 0 0 0 0 0 0 1 0
## 411 0 1 0 0 0 0 0 0 0
## 2559 0 0 0 0 0 0 0 0 0
## 1941 0 0 0 0 0 0 0 0 0
## 4505 0 0 0 0 0 0 0 0 0
## 2223 0 0 0 0 0 0 0 0 0
## 4156 0 0 0 1 0 0 0 0 0
## 3666 0 0 0 0 0 0 0 0 0
## 4031 0 0 0 0 0 0 0 0 0
## 1929 0 0 0 0 0 0 1 0 0
## 3404 0 0 0 0 1 0 0 0 0
## 20 0 0 0 0 0 0 0 1 0
## 4136 0 0 0 0 0 0 0 0 0
## 37 0 0 0 0 0 0 0 0 0
## 1031 0 0 0 0 0 0 0 0 0
## 4499 0 0 0 0 0 0 0 0 0
## 3036 0 0 0 0 0 0 0 0 0
## 1883 0 0 0 0 0 0 0 0 0
## 2161 0 0 0 0 0 0 0 0 0
## 186 0 0 0 0 0 0 0 0 0
## 4826 0 0 0 0 0 0 0 0 0
## 2140 0 0 0 0 0 0 0 0 0
## 4745 0 0 0 0 0 0 0 0 0
## 4398 0 0 0 0 0 0 0 0 0
## 3170 0 0 0 0 0 0 0 0 0
## 4809 0 0 0 0 0 0 0 0 0
## 3064 0 0 0 0 0 0 0 0 0
## 1651 1 0 0 0 0 0 0 0 0
## 1717 0 0 0 0 0 0 0 0 0
## 1972 0 0 0 0 0 0 0 0 0
## 3882 0 0 0 0 0 0 0 0 0
## 193 0 0 0 0 0 0 0 0 0
## 3703 0 1 0 0 0 0 0 0 0
## 3349 0 0 0 0 0 0 0 0 0
## 847 0 0 0 0 0 0 0 0 0
## 1291 0 0 0 0 0 0 0 0 0
## 2542 0 0 0 0 0 0 0 0 0
## 3338 0 0 0 0 1 0 0 0 0
## 4855 0 0 0 1 0 0 0 0 0
## 3751 0 0 0 0 0 0 0 0 0
## 2797 0 0 0 0 0 0 0 0 0
## 4195 0 0 0 0 0 0 0 0 0
## 936 0 0 0 0 0 0 0 0 0
## 1339 0 0 0 0 0 0 0 1 0
## 4086 0 0 0 0 0 0 0 0 0
## 3419 0 0 0 0 0 1 0 0 0
## 1187 0 0 0 0 0 0 1 0 0
## 212 0 0 0 0 0 0 0 0 0
## 693 0 0 0 0 0 0 0 0 0
## 1067 0 0 0 0 0 0 0 0 0
## 2362 0 0 0 0 0 0 0 0 0
## 973 0 0 0 0 0 0 0 0 0
## 3543 0 0 0 0 0 0 0 0 0
## 39 0 0 0 0 0 0 0 0 0
## 1849 0 0 0 0 0 0 0 0 0
## 2532 0 0 0 1 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0
## 2862 0 0 0 0 0 0 0 0 0
## 777 0 0 0 0 0 0 0 0 0
## 1766 0 0 0 0 0 0 0 0 0
## 3175 0 0 0 0 1 0 0 0 0
## 3814 0 0 0 0 0 0 0 0 0
## 2771 0 0 0 0 0 0 0 0 0
## 1149 0 0 0 0 0 0 0 0 0
## 443 0 0 0 0 0 0 0 0 0
## 421 0 0 0 0 0 0 0 0 0
## 1499 0 0 0 0 0 0 0 0 0
## 3278 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 1024 0 0 0 0 0 0 0 0 0
## 4579 0 0 0 0 0 0 0 0 0
## 4542 0 0 0 0 0 0 0 0 0
## 3601 0 0 0 0 0 0 0 0 0
## 1634 0 0 0 0 0 0 0 0 0
## 2526 0 0 0 0 0 0 1 0 0
## 3647 0 0 0 0 0 0 0 0 0
## 3035 0 0 0 0 0 0 0 0 0
## 3069 0 0 0 0 1 0 0 0 0
## 1064 0 0 0 0 0 0 0 0 0
## 1061 0 0 0 0 0 0 0 0 0
## 1905 0 0 0 0 0 0 0 0 0
## 4615 0 0 0 0 0 0 0 0 0
## 4977 0 0 0 0 0 0 0 0 0
## 3621 0 0 0 0 0 0 0 0 0
## 4989 0 0 0 0 0 0 0 0 0
## 2621 0 0 0 0 0 0 1 0 0
## 12 0 0 0 1 0 0 0 0 0
## 2978 0 0 0 0 0 0 0 0 0
## 4092 0 0 0 0 0 0 0 0 0
## 3674 0 0 0 0 0 0 0 0 0
## 2213 0 0 0 0 0 0 0 0 0
## 2618 0 0 0 0 0 1 0 0 0
## 2626 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0
## 1737 0 0 0 0 0 0 0 0 0
## 2989 0 0 0 0 0 0 0 0 0
## 4047 0 0 0 0 0 0 0 0 0
## 1741 0 0 0 0 0 0 0 0 0
## 2004 0 1 0 0 0 0 0 0 0
## 2798 0 0 0 0 0 0 0 0 0
## 2876 0 0 0 0 0 0 0 0 0
## 3510 0 0 0 0 0 0 0 0 0
## 1926 0 0 0 0 0 0 0 0 0
## 4481 0 0 0 0 0 0 0 0 0
## 4691 0 0 0 0 0 0 0 0 0
## 1138 0 0 0 0 0 0 0 1 0
## 3530 0 0 0 0 0 0 0 0 0
## 4401 0 0 1 0 0 0 0 0 0
## 2939 0 0 0 0 0 0 0 0 0
## 3075 0 0 0 0 0 0 0 0 0
## 4563 0 0 0 0 0 0 0 0 0
## 4139 0 0 0 0 0 0 0 0 0
## 2821 0 0 0 0 0 0 0 0 0
## 3996 0 0 0 0 0 0 0 0 0
## 554 0 0 0 0 0 0 0 0 1
## 3718 0 0 0 0 0 0 0 0 0
## 3032 0 0 0 0 0 0 0 0 0
## 722 0 0 0 0 0 0 0 0 1
## 391 0 0 0 0 0 0 0 0 0
## 2255 0 0 0 0 0 0 0 0 0
## 3786 0 0 0 0 0 0 0 0 0
## 3563 0 0 0 0 0 0 0 0 0
## 3968 0 0 0 0 0 0 0 0 1
## 826 0 0 0 0 0 0 0 0 0
## 4585 0 0 0 0 0 0 0 0 0
## 1425 0 0 0 0 0 0 0 0 0
## 724 0 0 0 0 0 0 0 0 0
## 3489 0 0 0 0 0 0 0 0 0
## 1572 0 0 0 0 0 0 0 0 0
## 3776 0 0 0 0 0 0 0 0 0
## 1912 0 0 0 0 0 0 0 0 0
## 3289 0 0 0 0 1 0 0 0 0
## 3759 0 0 0 0 0 0 0 0 0
## 911 0 0 0 0 0 0 0 0 0
## 141 0 0 0 0 0 0 0 0 0
## 658 0 0 0 0 0 0 0 0 0
## 3293 0 0 0 0 0 0 0 0 0
## 4525 0 0 0 0 0 0 0 0 0
## 2664 0 0 0 0 0 0 0 0 0
## 2912 0 0 0 0 0 0 0 0 0
## 953 0 0 1 0 0 0 0 0 0
## 2589 0 0 0 0 0 0 0 0 0
## 869 0 0 0 0 0 0 0 0 0
## 2185 0 0 0 0 0 0 0 0 0
## 1533 0 0 0 0 0 0 0 0 0
## 562 0 0 0 1 0 0 0 0 0
## 900 0 0 0 0 0 0 0 0 0
## 3525 0 0 0 0 0 0 0 0 0
## 1989 0 0 0 0 0 0 0 0 0
## 2000 0 0 0 0 0 0 0 0 0
## 2319 0 0 0 0 0 0 0 0 0
## 2064 0 0 0 0 0 0 0 0 0
## 659 0 0 0 0 0 0 0 0 0
## 3979 0 0 0 0 0 0 0 0 0
## 2857 0 0 0 0 0 0 0 0 0
## 3831 0 0 0 0 0 0 0 0 0
## 3708 0 0 0 0 0 0 0 0 0
## 4426 0 0 0 0 0 0 0 0 0
## 4158 0 1 0 0 0 0 0 0 0
## 1528 0 0 0 0 0 0 0 0 0
## 1249 0 0 1 0 0 0 0 0 0
## 3575 1 0 0 0 0 0 0 0 0
## 3599 0 0 0 0 0 0 0 0 0
## 4419 0 0 0 0 0 0 0 0 0
## 3818 0 0 0 0 0 0 0 0 0
## 642 0 0 0 0 0 0 0 0 0
## 1385 0 0 0 0 0 0 0 0 0
## 937 0 0 0 0 0 0 0 0 0
## 3771 0 0 0 0 0 0 0 0 0
## 620 0 0 0 0 0 0 0 0 0
## 621 0 0 0 0 0 0 0 0 0
## 348 0 0 0 0 0 0 0 0 0
## 256 0 0 0 0 0 0 0 0 0
## 2556 0 0 0 0 0 0 0 0 0
## 540 0 0 0 0 0 0 0 0 0
## 3569 0 0 0 0 0 0 0 0 0
## 3512 0 0 0 0 0 0 0 0 0
## 4249 0 0 0 0 0 0 0 0 0
## 2482 0 0 0 0 0 0 0 0 0
## 4088 0 0 0 0 0 0 0 0 0
## 2125 0 0 0 0 0 0 0 0 0
## 758 0 0 0 0 0 0 0 0 1
## 2121 0 0 0 0 0 0 0 0 1
## 4640 0 0 0 0 0 0 0 0 0
## 2323 0 0 0 0 0 0 0 0 0
## 1210 0 0 0 0 0 0 0 0 0
## 1245 0 0 0 0 0 0 0 0 0
## 2597 0 0 0 0 0 0 0 0 0
## 3113 0 0 0 0 0 0 0 0 0
## 1611 0 0 0 1 0 0 0 0 0
## 292 0 0 0 0 0 0 0 0 0
## 2160 0 0 0 0 0 0 0 0 0
## 4014 0 0 0 0 0 0 0 0 1
## 2750 0 0 0 0 0 0 0 0 0
## 1691 0 0 1 0 0 0 0 0 0
## 4886 0 0 0 0 0 0 0 0 0
## 4269 0 0 0 0 0 0 0 0 0
## 2343 0 0 0 0 0 0 0 0 0
## 821 0 0 0 0 0 0 0 0 0
## 2595 0 0 0 0 0 0 0 0 0
## 4593 0 0 0 0 0 1 0 0 0
## 4911 0 0 0 0 0 0 0 0 0
## 3918 0 0 0 0 0 0 0 0 0
## 1466 0 0 0 0 0 0 0 0 0
## 886 0 0 0 0 0 0 0 0 0
## 231 0 0 0 0 0 0 0 0 0
## 1173 0 0 0 0 0 0 0 0 0
## 1675 0 0 0 0 0 0 0 0 0
## 759 0 0 0 0 0 0 0 0 0
## 1450 0 0 0 0 0 0 0 0 0
## 84 0 0 0 0 0 0 0 0 0
## 4750 0 0 0 0 0 0 0 0 0
## 3833 0 0 0 0 0 0 0 0 0
## 413 0 0 0 0 0 0 0 0 0
## 4144 0 0 0 0 0 0 0 0 0
## 2641 0 0 0 0 0 0 0 0 0
## 2007 0 0 0 0 0 0 0 0 0
## 322 0 0 0 0 0 0 0 0 0
## 2672 0 0 0 0 1 0 0 0 0
## 337 0 0 0 0 1 0 0 0 0
## 1006 0 0 0 0 0 0 0 0 0
## 2614 0 0 0 0 0 0 0 0 0
## 2292 0 1 0 0 0 0 0 0 0
## 4769 0 0 0 0 0 0 0 0 0
## 711 0 0 0 0 0 0 1 0 0
## 2373 0 0 0 0 0 0 0 0 0
## 4469 0 0 0 0 0 0 0 0 0
## stateVA stateVT stateWA stateWI stateWV stateWY account_length
## 4575 0 0 0 0 0 0 137
## 4685 0 0 0 0 0 0 83
## 1431 0 0 0 0 1 0 48
## 4150 0 0 0 0 0 0 67
## 3207 0 0 1 0 0 0 143
## 2593 1 0 0 0 0 0 163
## 3679 0 0 0 0 0 0 100
## 673 0 0 0 0 0 0 151
## 3280 0 0 0 0 0 0 139
## 3519 0 0 0 0 0 0 17
## 2285 0 0 0 0 0 0 114
## 3588 0 0 0 0 1 0 78
## 4663 0 0 0 0 0 0 65
## 1274 0 0 0 0 0 0 28
## 2305 0 0 0 0 0 0 93
## 4686 0 0 0 0 0 0 92
## 4876 0 0 0 0 0 0 77
## 586 0 0 0 0 0 0 110
## 2367 0 0 0 0 0 0 122
## 2792 0 0 0 0 0 0 151
## 4503 0 0 0 0 0 0 60
## 691 0 0 0 0 0 0 88
## 4923 0 0 0 0 0 0 102
## 4712 0 0 0 0 0 0 26
## 411 0 0 0 0 0 0 25
## 2559 0 0 0 0 0 0 94
## 1941 0 0 0 1 0 0 123
## 4505 0 0 0 0 0 0 123
## 2223 0 0 0 0 0 0 97
## 4156 0 0 0 0 0 0 131
## 3666 0 0 0 0 1 0 166
## 4031 0 0 0 0 0 0 107
## 1929 0 0 0 0 0 0 102
## 3404 0 0 0 0 0 0 84
## 20 0 0 0 0 0 0 73
## 4136 0 0 0 0 0 0 74
## 37 0 0 0 0 0 0 36
## 1031 1 0 0 0 0 0 99
## 4499 0 0 0 0 0 0 77
## 3036 0 0 0 0 0 0 88
## 1883 0 0 0 0 0 0 132
## 2161 0 0 0 0 0 0 94
## 186 0 0 0 0 0 0 73
## 4826 0 0 0 0 0 0 83
## 2140 1 0 0 0 0 0 92
## 4745 0 0 0 0 0 0 43
## 4398 0 0 0 0 0 0 69
## 3170 0 0 0 0 0 0 78
## 4809 0 0 0 0 0 0 145
## 3064 0 0 0 0 0 0 63
## 1651 0 0 0 0 0 0 91
## 1717 0 0 0 0 0 0 36
## 1972 0 0 0 0 0 0 74
## 3882 0 0 0 0 0 0 162
## 193 0 0 0 0 0 0 80
## 3703 0 0 0 0 0 0 3
## 3349 0 0 0 0 0 0 96
## 847 0 0 0 0 0 0 96
## 1291 0 0 0 0 0 0 92
## 2542 0 0 0 0 0 0 73
## 3338 0 0 0 0 0 0 108
## 4855 0 0 0 0 0 0 158
## 3751 0 0 0 0 0 0 141
## 2797 0 0 0 0 0 0 24
## 4195 0 0 0 0 0 0 97
## 936 0 0 0 0 0 0 86
## 1339 0 0 0 0 0 0 28
## 4086 0 0 0 0 1 0 159
## 3419 0 0 0 0 0 0 166
## 1187 0 0 0 0 0 0 130
## 212 0 0 0 0 0 0 144
## 693 0 0 0 0 0 0 82
## 1067 0 0 0 0 0 0 117
## 2362 0 0 0 0 0 0 66
## 973 0 0 0 0 0 0 87
## 3543 0 0 0 0 0 0 93
## 39 0 0 0 0 0 0 136
## 1849 0 0 0 0 0 0 148
## 2532 0 0 0 0 0 0 180
## 8 0 0 0 0 0 0 147
## 2862 0 0 0 0 0 0 75
## 777 0 0 0 0 0 0 61
## 1766 0 0 0 0 0 0 111
## 3175 0 0 0 0 0 0 36
## 3814 0 0 0 1 0 0 77
## 2771 0 0 0 0 0 0 84
## 1149 0 0 0 0 0 0 122
## 443 0 0 0 0 0 0 59
## 421 0 0 0 0 0 0 141
## 1499 0 0 0 0 0 0 50
## 3278 0 0 0 0 0 0 134
## 2 0 0 0 0 0 0 107
## 1024 0 0 0 0 0 0 100
## 4579 0 0 0 0 0 0 116
## 4542 0 0 0 0 0 0 34
## 3601 0 0 0 0 0 0 104
## 1634 0 0 0 0 0 0 87
## 2526 0 0 0 0 0 0 95
## 3647 0 0 0 0 0 0 167
## 3035 0 0 0 0 0 0 75
## 3069 0 0 0 0 0 0 78
## 1064 0 0 0 0 0 0 101
## 1061 0 0 0 0 0 0 124
## 1905 0 0 1 0 0 0 174
## 4615 0 1 0 0 0 0 67
## 4977 0 0 0 0 0 0 145
## 3621 0 0 0 0 0 0 160
## 4989 0 0 1 0 0 0 80
## 2621 0 0 0 0 0 0 115
## 12 0 0 0 0 0 0 74
## 2978 0 0 0 0 0 0 132
## 4092 0 0 0 0 0 0 88
## 3674 0 0 0 0 0 0 171
## 2213 0 0 0 0 0 0 8
## 2618 0 0 0 0 0 0 64
## 2626 0 0 0 0 0 0 88
## 7 0 0 0 0 0 0 121
## 1737 0 0 0 0 0 0 115
## 2989 1 0 0 0 0 0 105
## 4047 0 0 0 0 0 0 73
## 1741 0 0 0 0 0 0 153
## 2004 0 0 0 0 0 0 33
## 2798 0 0 0 0 0 0 169
## 2876 0 0 0 0 0 0 123
## 3510 0 0 0 0 0 0 59
## 1926 0 0 0 0 0 0 84
## 4481 0 0 0 0 0 0 114
## 4691 0 0 0 0 0 0 138
## 1138 0 0 0 0 0 0 19
## 3530 0 0 0 0 0 0 148
## 4401 0 0 0 0 0 0 86
## 2939 0 0 0 0 0 0 31
## 3075 0 0 0 0 1 0 113
## 4563 0 0 0 0 0 0 42
## 4139 0 0 0 0 0 0 102
## 2821 0 0 0 0 0 0 74
## 3996 0 0 0 0 0 0 150
## 554 0 0 0 0 0 0 61
## 3718 0 0 0 0 0 0 185
## 3032 0 0 0 0 0 0 64
## 722 0 0 0 0 0 0 103
## 391 0 0 0 0 1 0 43
## 2255 0 0 0 0 0 0 97
## 3786 0 0 0 0 0 0 126
## 3563 0 0 0 0 1 0 66
## 3968 0 0 0 0 0 0 91
## 826 0 0 0 0 0 0 74
## 4585 0 0 0 0 0 1 25
## 1425 0 0 1 0 0 0 78
## 724 0 0 0 0 0 0 78
## 3489 0 0 0 0 0 0 131
## 1572 0 0 0 0 0 0 179
## 3776 0 0 0 0 0 0 83
## 1912 0 0 0 0 0 0 118
## 3289 0 0 0 0 0 0 78
## 3759 0 0 0 0 0 0 89
## 911 0 0 0 0 0 0 42
## 141 0 0 0 0 0 0 110
## 658 0 0 0 0 0 0 130
## 3293 0 0 0 0 0 0 114
## 4525 0 0 0 0 0 0 146
## 2664 1 0 0 0 0 0 172
## 2912 0 0 0 0 0 0 127
## 953 0 0 0 0 0 0 53
## 2589 0 0 0 0 0 0 53
## 869 0 0 0 0 0 0 42
## 2185 0 0 0 0 0 0 92
## 1533 0 0 0 0 0 0 103
## 562 0 0 0 0 0 0 53
## 900 1 0 0 0 0 0 89
## 3525 1 0 0 0 0 0 71
## 1989 0 0 0 0 0 0 59
## 2000 0 0 0 0 0 1 160
## 2319 0 0 0 0 1 0 106
## 2064 0 0 0 0 0 0 90
## 659 1 0 0 0 0 0 34
## 3979 0 0 0 0 0 0 111
## 2857 0 0 0 0 0 0 123
## 3831 0 0 0 0 0 0 60
## 3708 0 0 0 0 0 0 74
## 4426 0 0 0 0 0 0 85
## 4158 0 0 0 0 0 0 174
## 1528 0 0 0 0 0 0 36
## 1249 0 0 0 0 0 0 101
## 3575 0 0 0 0 0 0 177
## 3599 0 0 0 0 0 0 139
## 4419 0 0 0 0 0 0 90
## 3818 0 0 0 0 0 0 40
## 642 1 0 0 0 0 0 74
## 1385 0 0 0 0 0 0 141
## 937 0 0 0 0 0 0 91
## 3771 0 0 0 0 0 0 112
## 620 0 0 0 0 0 0 110
## 621 0 0 0 0 0 0 163
## 348 0 1 0 0 0 0 162
## 256 0 0 0 0 0 0 106
## 2556 0 0 0 0 0 0 190
## 540 0 0 0 0 0 0 59
## 3569 0 0 0 0 0 0 48
## 3512 0 0 0 0 0 0 12
## 4249 0 0 0 0 0 0 120
## 2482 1 0 0 0 0 0 80
## 4088 0 0 0 0 0 0 142
## 2125 0 0 0 0 0 0 43
## 758 0 0 0 0 0 0 112
## 2121 0 0 0 0 0 0 81
## 4640 0 0 0 0 0 0 91
## 2323 0 0 0 0 0 0 109
## 1210 0 0 0 0 0 0 144
## 1245 0 0 0 0 0 0 45
## 2597 0 0 0 0 0 0 73
## 3113 0 0 0 0 0 0 115
## 1611 0 0 0 0 0 0 104
## 292 0 0 0 0 0 0 132
## 2160 0 0 0 0 0 0 132
## 4014 0 0 0 0 0 0 188
## 2750 0 0 0 0 0 0 95
## 1691 0 0 0 0 0 0 174
## 4886 0 0 0 0 0 0 122
## 4269 0 0 0 0 0 0 155
## 2343 0 0 0 0 0 0 61
## 821 0 0 0 0 1 0 92
## 2595 0 0 0 0 0 0 115
## 4593 0 0 0 0 0 0 72
## 4911 0 0 0 0 0 0 58
## 3918 0 0 0 0 0 1 65
## 1466 0 0 0 0 0 0 83
## 886 0 0 0 0 0 0 66
## 231 0 0 0 0 0 0 93
## 1173 0 0 0 0 0 0 166
## 1675 0 0 0 0 0 0 76
## 759 0 0 0 0 0 0 108
## 1450 0 0 0 0 0 0 192
## 84 0 0 0 0 0 0 70
## 4750 0 0 0 0 0 0 124
## 3833 1 0 0 0 0 0 54
## 413 0 0 0 0 0 0 85
## 4144 0 0 0 0 0 1 86
## 2641 0 0 0 0 0 0 105
## 2007 0 0 0 0 0 0 91
## 322 0 0 0 0 0 1 114
## 2672 0 0 0 0 0 0 152
## 337 0 0 0 0 0 0 99
## 1006 0 0 0 0 0 0 12
## 2614 0 0 0 0 0 0 48
## 2292 0 0 0 0 0 0 69
## 4769 0 0 0 0 0 0 48
## 711 0 0 0 0 0 0 69
## 2373 0 0 0 0 0 0 114
## 4469 0 0 0 0 0 0 90
## area_codearea_code_415 area_codearea_code_510 international_planyes
## 4575 0 1 0
## 4685 1 0 0
## 1431 1 0 0
## 4150 1 0 0
## 3207 0 1 0
## 2593 1 0 0
## 3679 1 0 0
## 673 0 0 0
## 3280 1 0 0
## 3519 0 1 0
## 2285 1 0 0
## 3588 0 0 0
## 4663 1 0 0
## 1274 1 0 0
## 2305 1 0 0
## 4686 1 0 0
## 4876 1 0 0
## 586 0 0 0
## 2367 0 0 0
## 2792 1 0 0
## 4503 1 0 1
## 691 1 0 0
## 4923 1 0 0
## 4712 1 0 1
## 411 0 0 0
## 2559 0 0 0
## 1941 1 0 0
## 4505 1 0 0
## 2223 1 0 0
## 4156 1 0 0
## 3666 0 0 0
## 4031 1 0 0
## 1929 0 1 0
## 3404 1 0 0
## 20 1 0 0
## 4136 0 0 0
## 37 0 0 0
## 1031 1 0 0
## 4499 1 0 1
## 3036 1 0 0
## 1883 0 0 1
## 2161 0 0 1
## 186 1 0 0
## 4826 0 0 1
## 2140 0 1 1
## 4745 1 0 0
## 4398 0 1 0
## 3170 1 0 0
## 4809 1 0 0
## 3064 0 0 0
## 1651 0 0 0
## 1717 0 1 0
## 1972 1 0 0
## 3882 0 0 0
## 193 0 0 0
## 3703 0 1 0
## 3349 0 0 0
## 847 1 0 0
## 1291 0 0 0
## 2542 0 0 0
## 3338 1 0 0
## 4855 0 1 0
## 3751 1 0 0
## 2797 1 0 0
## 4195 0 1 0
## 936 0 0 0
## 1339 1 0 0
## 4086 1 0 0
## 3419 0 1 0
## 1187 1 0 0
## 212 0 1 1
## 693 0 0 0
## 1067 0 1 0
## 2362 1 0 0
## 973 1 0 0
## 3543 1 0 0
## 39 1 0 1
## 1849 0 1 1
## 2532 1 0 0
## 8 1 0 1
## 2862 1 0 0
## 777 1 0 0
## 1766 1 0 0
## 3175 0 0 0
## 3814 0 0 0
## 2771 0 0 0
## 1149 1 0 0
## 443 0 1 0
## 421 1 0 0
## 1499 0 0 0
## 3278 1 0 0
## 2 1 0 0
## 1024 0 1 0
## 4579 0 0 0
## 4542 0 0 0
## 3601 0 1 0
## 1634 1 0 0
## 2526 0 1 0
## 3647 1 0 1
## 3035 1 0 0
## 3069 1 0 0
## 1064 0 1 0
## 1061 0 1 0
## 1905 0 0 0
## 4615 0 0 0
## 4977 0 0 0
## 3621 1 0 0
## 4989 0 1 0
## 2621 1 0 0
## 12 1 0 0
## 2978 0 0 0
## 4092 0 0 0
## 3674 1 0 0
## 2213 1 0 0
## 2618 1 0 0
## 2626 0 0 0
## 7 0 1 0
## 1737 1 0 0
## 2989 1 0 0
## 4047 1 0 0
## 1741 1 0 0
## 2004 1 0 1
## 2798 0 0 0
## 2876 0 1 0
## 3510 1 0 0
## 1926 1 0 0
## 4481 0 0 0
## 4691 0 1 0
## 1138 0 1 0
## 3530 0 1 0
## 4401 0 1 0
## 2939 1 0 0
## 3075 0 1 0
## 4563 1 0 0
## 4139 1 0 1
## 2821 0 1 0
## 3996 1 0 0
## 554 0 1 1
## 3718 0 1 0
## 3032 1 0 1
## 722 0 1 0
## 391 1 0 0
## 2255 1 0 0
## 3786 1 0 0
## 3563 1 0 0
## 3968 0 1 0
## 826 1 0 0
## 4585 0 0 0
## 1425 0 0 0
## 724 0 1 0
## 3489 0 0 0
## 1572 0 0 0
## 3776 0 0 0
## 1912 0 0 0
## 3289 1 0 0
## 3759 1 0 0
## 911 1 0 0
## 141 0 1 0
## 658 1 0 0
## 3293 0 0 0
## 4525 0 1 1
## 2664 0 1 0
## 2912 1 0 1
## 953 1 0 0
## 2589 0 1 0
## 869 0 0 0
## 2185 1 0 0
## 1533 1 0 0
## 562 0 0 0
## 900 1 0 0
## 3525 1 0 0
## 1989 0 0 0
## 2000 0 0 0
## 2319 0 1 0
## 2064 1 0 0
## 659 1 0 0
## 3979 0 0 1
## 2857 1 0 0
## 3831 0 1 0
## 3708 0 1 0
## 4426 1 0 0
## 4158 1 0 0
## 1528 0 1 0
## 1249 1 0 1
## 3575 1 0 0
## 3599 1 0 0
## 4419 1 0 0
## 3818 0 1 0
## 642 0 0 0
## 1385 0 0 0
## 937 0 1 0
## 3771 0 1 0
## 620 1 0 1
## 621 1 0 0
## 348 0 1 0
## 256 0 0 0
## 2556 0 0 0
## 540 0 0 0
## 3569 0 0 0
## 3512 1 0 0
## 4249 1 0 0
## 2482 1 0 0
## 4088 0 0 1
## 2125 1 0 0
## 758 1 0 0
## 2121 0 1 0
## 4640 0 1 0
## 2323 0 1 0
## 1210 0 1 1
## 1245 0 0 0
## 2597 0 0 0
## 3113 1 0 0
## 1611 0 0 1
## 292 0 1 0
## 2160 1 0 0
## 4014 0 0 0
## 2750 1 0 0
## 1691 1 0 0
## 4886 0 1 0
## 4269 1 0 1
## 2343 1 0 0
## 821 0 1 0
## 2595 0 1 1
## 4593 1 0 0
## 4911 0 0 0
## 3918 1 0 0
## 1466 0 1 0
## 886 0 0 0
## 231 0 0 1
## 1173 1 0 0
## 1675 0 1 1
## 759 0 1 0
## 1450 1 0 0
## 84 1 0 0
## 4750 1 0 0
## 3833 0 0 0
## 413 1 0 0
## 4144 1 0 0
## 2641 1 0 0
## 2007 0 1 0
## 322 1 0 0
## 2672 0 0 0
## 337 0 1 0
## 1006 0 1 1
## 2614 1 0 0
## 2292 0 0 0
## 4769 0 1 0
## 711 0 1 0
## 2373 1 0 0
## 4469 1 0 0
## voice_mail_planyes number_vmail_messages total_day_minutes total_day_calls
## 4575 0 0 109.8 112
## 4685 0 0 196.7 117
## 1431 1 34 198.0 70
## 4150 0 0 164.5 79
## 3207 0 0 133.4 107
## 2593 0 0 202.9 100
## 3679 1 39 204.9 74
## 673 0 0 175.3 106
## 3280 1 43 231.0 85
## 3519 1 30 280.0 101
## 2285 0 0 155.3 75
## 3588 0 0 226.3 88
## 4663 0 0 180.3 84
## 1274 0 0 121.7 48
## 2305 1 32 116.9 120
## 4686 0 0 114.6 94
## 4876 0 0 175.8 116
## 586 0 0 55.3 102
## 2367 1 45 147.8 85
## 2792 0 0 194.8 106
## 4503 1 45 135.6 101
## 691 0 0 189.8 111
## 4923 1 32 135.7 108
## 4712 0 0 194.4 105
## 411 0 0 178.8 90
## 2559 0 0 136.2 114
## 1941 0 0 172.2 92
## 4505 0 0 132.6 100
## 2223 1 24 176.1 109
## 4156 0 0 131.3 39
## 3666 1 41 196.7 109
## 4031 0 0 260.6 81
## 1929 0 0 103.1 70
## 3404 0 0 204.1 132
## 20 0 0 224.4 90
## 4136 0 0 286.9 113
## 37 1 30 146.3 128
## 1031 1 42 216.0 125
## 4499 0 0 132.9 86
## 3036 0 0 85.7 112
## 1883 1 33 200.3 75
## 2161 0 0 89.5 94
## 186 0 0 214.3 145
## 4826 0 0 197.6 105
## 2140 0 0 252.3 120
## 4745 0 0 253.6 110
## 4398 0 0 221.6 88
## 3170 0 0 103.5 115
## 4809 0 0 219.0 98
## 3064 1 25 190.0 137
## 1651 1 31 273.0 78
## 1717 1 25 152.8 110
## 1972 1 32 174.6 107
## 3882 1 21 205.2 128
## 193 0 0 209.9 74
## 3703 0 0 125.5 100
## 3349 0 0 208.1 93
## 847 0 0 180.6 92
## 1291 0 0 249.4 118
## 2542 0 0 94.9 121
## 3338 0 0 197.4 78
## 4855 0 0 245.1 68
## 3751 0 0 208.5 129
## 2797 0 0 235.6 132
## 4195 0 0 80.7 99
## 936 0 0 226.3 88
## 1339 0 0 159.7 79
## 4086 0 0 181.0 107
## 3419 1 32 148.4 86
## 1187 1 12 141.9 92
## 212 0 0 203.5 100
## 693 0 0 185.8 36
## 1067 1 25 216.0 140
## 2362 0 0 116.4 98
## 973 0 0 228.7 90
## 3543 0 0 172.0 80
## 39 1 33 203.9 106
## 1849 0 0 148.2 138
## 2532 0 0 143.3 134
## 8 0 0 157.0 79
## 2862 1 19 210.3 90
## 777 1 20 254.4 133
## 1766 0 0 132.6 125
## 3175 1 43 29.9 123
## 3814 0 0 168.1 83
## 2771 1 30 106.5 65
## 1149 0 0 173.6 110
## 443 1 29 133.1 114
## 421 1 28 206.9 126
## 1499 0 0 154.7 102
## 3278 0 0 164.9 115
## 2 1 26 161.6 123
## 1024 0 0 142.5 87
## 4579 0 0 85.8 88
## 4542 0 0 151.1 92
## 3601 0 0 189.3 73
## 1634 0 0 204.8 101
## 2526 0 0 174.0 57
## 3647 0 0 143.7 108
## 3035 1 42 248.9 93
## 3069 1 21 160.6 85
## 1064 0 0 93.8 127
## 1061 0 0 193.0 97
## 1905 1 33 167.8 91
## 4615 1 33 127.5 126
## 4977 0 0 135.0 122
## 3621 0 0 237.1 126
## 4989 0 0 157.0 101
## 2621 0 0 206.2 113
## 12 0 0 187.7 127
## 2978 0 0 195.1 100
## 4092 0 0 242.4 89
## 3674 0 0 262.1 78
## 2213 1 36 242.9 67
## 2618 0 0 174.5 98
## 2626 0 0 152.9 119
## 7 1 24 218.2 88
## 1737 0 0 286.4 125
## 2989 0 0 259.3 96
## 4047 1 33 129.8 112
## 1741 1 31 218.5 130
## 2004 0 0 190.6 100
## 2798 0 0 142.5 82
## 2876 1 32 212.3 77
## 3510 1 26 205.4 68
## 1926 0 0 190.2 102
## 4481 0 0 139.8 152
## 4691 0 0 54.8 123
## 1138 1 34 156.6 97
## 3530 0 0 230.6 92
## 4401 0 0 212.8 79
## 2939 0 0 97.5 129
## 3075 0 0 72.5 88
## 4563 0 0 137.9 160
## 4139 0 0 209.8 85
## 2821 1 27 154.1 122
## 3996 1 24 195.7 108
## 554 0 0 78.2 103
## 3718 1 24 121.6 100
## 3032 0 0 146.7 83
## 722 1 36 87.2 92
## 391 0 0 168.4 125
## 2255 1 28 202.3 97
## 3786 0 0 167.1 138
## 3563 0 0 180.4 99
## 3968 0 0 142.6 90
## 826 0 0 172.1 105
## 4585 0 0 199.2 80
## 1425 0 0 140.7 77
## 724 0 0 137.4 109
## 3489 1 29 239.7 102
## 1572 0 0 234.5 134
## 3776 0 0 48.4 105
## 1912 0 0 188.8 60
## 3289 0 0 109.5 105
## 3759 1 15 274.0 83
## 911 1 32 163.8 80
## 141 0 0 148.5 115
## 658 0 0 242.5 101
## 3293 0 0 203.8 85
## 4525 0 0 163.5 85
## 2664 0 0 169.8 123
## 2912 0 0 256.5 87
## 953 0 0 205.1 86
## 2589 1 37 167.3 99
## 869 0 0 196.5 89
## 2185 1 31 172.3 116
## 1533 1 18 149.9 84
## 562 1 18 146.8 107
## 900 1 32 209.9 113
## 3525 1 49 174.0 122
## 1989 0 0 150.2 70
## 2000 0 0 82.7 116
## 2319 0 0 194.8 133
## 2064 0 0 114.4 122
## 659 0 0 151.0 102
## 3979 0 0 159.4 47
## 2857 0 0 204.4 88
## 3831 0 0 230.7 66
## 3708 0 0 233.6 72
## 4426 0 0 234.5 113
## 4158 0 0 222.3 102
## 1528 0 0 117.1 94
## 1249 0 0 193.7 108
## 3575 1 21 234.0 108
## 3599 0 0 63.3 116
## 4419 0 0 114.4 108
## 3818 0 0 146.4 105
## 642 0 0 165.3 120
## 1385 0 0 51.9 108
## 937 1 37 162.3 107
## 3771 0 0 172.1 73
## 620 0 0 293.3 79
## 621 0 0 191.3 89
## 348 0 0 220.6 117
## 256 1 32 165.9 126
## 2556 0 0 182.2 101
## 540 0 0 107.8 113
## 3569 1 26 192.9 60
## 3512 0 0 206.1 105
## 4249 0 0 79.2 123
## 2482 0 0 197.5 114
## 4088 0 0 160.6 101
## 2125 0 0 27.0 117
## 758 0 0 115.8 108
## 2121 0 0 154.5 84
## 4640 0 0 172.5 56
## 2323 1 35 230.5 116
## 1210 1 35 174.8 127
## 1245 1 38 196.8 92
## 2597 0 0 122.0 92
## 3113 0 0 139.3 89
## 1611 0 0 160.4 73
## 292 0 0 99.5 110
## 2160 0 0 190.1 105
## 4014 0 0 231.1 92
## 2750 0 0 229.9 116
## 1691 1 15 221.8 143
## 4886 0 0 265.7 130
## 4269 0 0 176.1 88
## 2343 0 0 188.9 105
## 821 1 16 184.0 99
## 2595 0 0 345.3 81
## 4593 1 38 84.3 116
## 4911 0 0 191.3 120
## 3918 1 21 135.1 120
## 1466 0 0 132.4 120
## 886 1 26 254.9 108
## 231 0 0 312.0 109
## 1173 0 0 203.4 81
## 1675 0 0 241.0 120
## 759 1 30 276.6 99
## 1450 0 0 221.6 101
## 84 1 24 249.5 101
## 4750 0 0 182.7 98
## 3833 0 0 197.7 84
## 413 0 0 259.8 85
## 4144 0 0 227.9 105
## 2641 1 24 274.7 99
## 2007 1 27 204.6 96
## 322 1 32 125.2 79
## 2672 0 0 140.5 92
## 337 0 0 169.2 70
## 1006 0 0 216.7 117
## 2614 0 0 240.0 88
## 2292 0 0 196.1 87
## 4769 0 0 200.5 117
## 711 0 0 195.3 70
## 2373 0 0 187.8 109
## 4469 0 0 259.1 140
## total_day_charge total_eve_minutes total_eve_calls total_eve_charge
## 4575 18.67 223.5 88 19.00
## 4685 33.44 272.0 89 23.12
## 1431 33.66 273.7 121 23.26
## 4150 27.97 110.3 108 9.38
## 3207 22.68 223.9 117 19.03
## 2593 34.49 178.6 46 15.18
## 3679 34.83 210.2 80 17.87
## 673 29.80 144.3 87 12.27
## 3280 39.27 222.3 82 18.90
## 3519 47.60 258.9 85 22.01
## 2285 26.40 169.9 87 14.44
## 3588 38.47 306.2 81 26.03
## 4663 30.65 199.9 129 16.99
## 1274 20.69 125.8 112 10.69
## 2305 19.87 232.4 97 19.75
## 4686 19.48 209.0 78 17.77
## 4876 29.89 156.2 125 13.28
## 586 9.40 164.7 124 14.00
## 2367 25.13 147.4 93 12.53
## 2792 33.12 292.7 103 24.88
## 4503 23.05 94.9 73 8.07
## 691 32.27 197.3 101 16.77
## 4923 23.07 216.5 134 18.40
## 4712 33.05 238.3 100 20.26
## 411 30.40 141.2 72 12.00
## 2559 23.15 165.1 118 14.03
## 1941 29.27 162.6 76 13.82
## 4505 22.54 198.8 102 16.90
## 2223 29.94 159.4 81 13.55
## 4156 22.32 242.9 101 20.65
## 3666 33.44 124.3 107 10.57
## 4031 44.30 246.1 116 20.92
## 1929 17.53 275.0 129 23.38
## 3404 34.70 164.4 117 13.97
## 20 38.15 159.5 88 13.56
## 4136 48.77 260.8 121 22.17
## 37 24.87 162.5 80 13.81
## 1031 36.72 232.3 104 19.75
## 4499 22.59 168.1 121 14.29
## 3036 14.57 221.6 70 18.84
## 1883 34.05 226.6 67 19.26
## 2161 15.22 339.9 106 28.89
## 186 36.43 268.5 135 22.82
## 4826 33.59 177.5 96 15.09
## 2140 42.89 207.0 112 17.60
## 4745 43.11 253.3 92 21.53
## 4398 37.67 231.8 87 19.70
## 3170 17.60 117.9 102 10.02
## 4809 37.23 176.1 109 14.97
## 3064 32.30 116.6 76 9.91
## 1651 46.41 215.5 98 18.32
## 1717 25.98 242.8 67 20.64
## 1972 29.68 310.6 115 26.40
## 3882 34.88 231.7 128 19.69
## 193 35.68 195.1 77 16.58
## 3703 21.34 266.2 78 22.63
## 3349 35.38 189.2 107 16.08
## 847 30.70 190.9 114 16.23
## 1291 42.40 211.5 95 17.98
## 2542 16.13 253.2 83 21.52
## 3338 33.56 124.0 101 10.54
## 4855 41.67 195.8 100 16.64
## 3751 35.45 190.3 86 16.18
## 2797 40.05 115.9 129 9.85
## 4195 13.72 262.3 93 22.30
## 936 38.47 223.0 107 18.96
## 1339 27.15 216.7 131 18.42
## 4086 30.77 188.4 114 16.01
## 3419 25.23 170.6 119 14.50
## 1187 24.12 228.9 102 19.46
## 212 34.60 247.6 103 21.05
## 693 31.59 276.5 134 23.50
## 1067 36.72 224.1 69 19.05
## 2362 19.79 95.6 74 8.13
## 973 38.88 163.0 99 13.86
## 3543 29.24 219.1 76 18.62
## 39 34.66 187.6 99 15.95
## 1849 25.19 159.6 123 13.57
## 2532 24.36 180.5 113 15.34
## 8 26.69 103.1 94 8.76
## 2862 35.75 241.8 87 20.55
## 777 43.25 161.7 96 13.74
## 1766 22.54 221.1 67 18.79
## 3175 5.08 129.1 117 10.97
## 3814 28.58 202.0 91 17.17
## 2771 18.11 225.7 108 19.18
## 1149 29.51 91.7 84 7.79
## 443 22.63 221.2 82 18.80
## 421 35.17 264.4 126 22.47
## 1499 26.30 298.0 108 25.33
## 3278 28.03 126.5 96 10.75
## 2 27.47 195.5 103 16.62
## 1024 24.23 195.7 88 16.63
## 4579 14.59 115.8 112 9.84
## 4542 25.69 211.7 86 17.99
## 3601 32.18 156.2 120 13.28
## 1634 34.82 161.0 80 13.69
## 2526 29.58 281.1 118 23.89
## 3647 24.43 75.6 111 6.43
## 3035 42.31 170.8 108 14.52
## 3069 27.30 223.1 79 18.96
## 1064 15.95 150.0 104 12.75
## 1061 32.81 89.8 99 7.63
## 1905 28.53 205.3 91 17.45
## 4615 21.68 296.1 129 25.17
## 4977 22.95 206.3 88 17.54
## 3621 40.31 238.4 127 20.26
## 4989 26.69 208.8 127 17.75
## 2621 35.05 176.4 102 14.99
## 12 31.91 163.4 148 13.89
## 2978 33.17 148.8 95 12.65
## 4092 41.21 161.4 89 13.72
## 3674 44.56 171.6 113 14.59
## 2213 41.29 170.9 59 14.53
## 2618 29.67 180.2 103 15.32
## 2626 25.99 171.2 107 14.55
## 7 37.09 348.5 108 29.62
## 1737 48.69 205.7 74 17.48
## 2989 44.08 175.2 97 14.89
## 4047 22.07 225.3 96 19.15
## 1741 37.15 134.2 103 11.41
## 2004 32.40 161.7 104 13.74
## 2798 24.23 231.4 110 19.67
## 2876 36.09 251.5 78 21.38
## 3510 34.92 115.0 115 9.78
## 1926 32.33 197.7 141 16.80
## 4481 23.77 215.9 76 18.35
## 4691 9.32 147.5 76 12.54
## 1138 26.62 224.2 97 19.06
## 3530 39.20 185.6 97 15.78
## 4401 36.18 90.9 145 7.73
## 2939 16.58 260.4 78 22.13
## 3075 12.33 204.0 112 17.34
## 4563 23.44 234.9 107 19.97
## 4139 35.67 165.7 100 14.08
## 2821 26.20 195.3 150 16.60
## 3996 33.27 208.8 110 17.75
## 554 13.29 195.9 149 16.65
## 3718 20.67 108.3 77 9.21
## 3032 24.94 148.3 91 12.61
## 722 14.82 169.3 110 14.39
## 391 28.63 243.8 89 20.72
## 2255 34.39 69.2 84 5.88
## 3786 28.41 154.4 93 13.12
## 3563 30.67 135.1 114 11.48
## 3968 24.24 286.8 102 24.38
## 826 29.26 211.7 99 17.99
## 4585 33.86 123.2 84 10.47
## 1425 23.92 195.2 114 16.59
## 724 23.36 237.6 49 20.20
## 3489 40.75 212.8 109 18.09
## 1572 39.87 164.2 94 13.96
## 3776 8.23 139.7 86 11.87
## 1912 32.10 217.4 64 18.48
## 3289 18.62 286.1 90 24.32
## 3759 46.58 192.2 97 16.34
## 911 27.85 177.8 123 15.11
## 141 25.25 276.4 84 23.49
## 658 41.23 102.8 114 8.74
## 3293 34.65 87.8 110 7.46
## 4525 27.80 224.8 149 19.11
## 2664 28.87 183.1 94 15.56
## 2912 43.61 222.1 101 18.88
## 953 34.87 160.5 95 13.64
## 2589 28.44 194.7 99 16.55
## 869 33.41 241.3 123 20.51
## 2185 29.29 266.2 91 22.63
## 1533 25.48 170.9 84 14.53
## 562 24.96 310.0 84 26.35
## 900 35.68 249.8 104 21.23
## 3525 29.58 168.6 85 14.33
## 1989 25.53 185.7 98 15.78
## 2000 14.06 194.6 95 16.54
## 2319 33.12 213.4 73 18.14
## 2064 19.45 127.7 154 10.85
## 659 25.67 131.4 101 11.17
## 3979 27.10 269.3 53 22.89
## 2857 34.75 137.5 111 11.69
## 3831 39.22 165.4 65 14.06
## 3708 39.71 168.0 103 14.28
## 4426 39.87 237.3 116 20.17
## 4158 37.79 173.8 123 14.77
## 1528 19.91 235.4 117 20.01
## 1249 32.93 186.6 98 15.86
## 3575 39.78 238.0 61 20.23
## 3599 10.76 177.3 113 15.07
## 4419 19.45 232.8 150 19.79
## 3818 24.89 196.4 143 16.69
## 642 28.10 198.5 106 16.87
## 1385 8.82 162.0 83 13.77
## 937 27.59 233.9 115 19.88
## 3771 29.26 228.0 136 19.38
## 620 49.86 188.5 90 16.02
## 621 32.52 193.9 87 16.48
## 348 37.50 155.2 121 13.19
## 256 28.20 216.5 93 18.40
## 2556 30.97 212.3 95 18.05
## 540 18.33 216.6 125 18.41
## 3569 32.79 179.2 87 15.23
## 3512 35.04 246.6 104 20.96
## 4249 13.46 212.1 106 18.03
## 2482 33.58 206.9 119 17.59
## 4088 27.30 202.0 125 17.17
## 2125 4.59 160.9 97 13.68
## 758 19.69 243.3 111 20.68
## 2121 26.27 216.2 91 18.38
## 4640 29.33 276.5 70 23.50
## 2323 39.19 265.8 130 22.59
## 1210 29.72 219.6 93 18.67
## 1245 33.46 254.2 108 21.61
## 2597 20.74 138.3 114 11.76
## 3113 23.68 192.3 95 16.35
## 1611 27.27 293.9 103 24.98
## 292 16.92 129.1 80 10.97
## 2160 32.32 182.2 116 15.49
## 4014 39.29 182.4 93 15.50
## 2750 39.08 202.4 110 17.20
## 1691 37.71 210.6 115 17.90
## 4886 45.17 47.3 116 4.02
## 4269 29.94 244.5 84 20.78
## 2343 32.11 153.6 116 13.06
## 821 31.28 76.4 134 6.49
## 2595 58.70 203.4 106 17.29
## 4593 14.33 267.2 127 22.71
## 4911 32.52 216.4 124 18.39
## 3918 22.97 238.4 90 20.26
## 1466 22.51 121.6 101 10.34
## 886 43.33 243.2 135 20.67
## 231 53.04 129.4 100 11.00
## 1173 34.58 167.7 110 14.25
## 1675 40.97 231.8 96 19.70
## 759 47.02 220.1 113 18.71
## 1450 37.67 285.2 50 24.24
## 84 42.42 259.7 98 22.07
## 4750 31.06 197.2 93 16.76
## 3833 33.61 221.3 89 18.81
## 413 44.17 242.3 117 20.60
## 4144 38.74 218.7 118 18.59
## 2641 46.70 193.5 118 16.45
## 2007 34.78 136.0 93 11.56
## 322 21.28 177.8 105 15.11
## 2672 23.89 186.8 96 15.88
## 337 28.76 271.5 77 23.08
## 1006 36.84 116.5 126 9.90
## 2614 40.80 141.0 117 11.99
## 2292 33.34 236.8 66 20.13
## 4769 34.09 225.3 123 19.15
## 711 33.20 216.7 108 18.42
## 2373 31.93 154.6 97 13.14
## 4469 44.05 124.0 119 10.54
## total_night_minutes total_night_calls total_night_charge
## 4575 247.5 96 11.14
## 4685 199.9 62 9.00
## 1431 217.9 71 9.81
## 4150 203.9 102 9.18
## 3207 180.4 85 8.12
## 2593 203.8 116 9.17
## 3679 168.8 89 7.60
## 673 160.2 88 7.21
## 3280 148.0 105 6.66
## 3519 230.0 130 10.35
## 2285 207.0 133 9.32
## 3588 200.9 120 9.04
## 4663 280.1 62 12.60
## 1274 261.6 122 11.77
## 2305 127.7 112 5.75
## 4686 209.7 97 9.44
## 4876 237.4 106 10.68
## 586 200.7 108 9.03
## 2367 203.5 110 9.16
## 2792 224.6 82 10.11
## 4503 154.0 118 6.93
## 691 234.5 111 10.55
## 4923 160.5 125 7.22
## 4712 239.1 129 10.76
## 411 203.0 99 9.14
## 2559 137.9 71 6.21
## 1941 250.3 101 11.26
## 4505 168.9 105 7.60
## 2223 269.1 94 12.11
## 4156 278.8 100 12.55
## 3666 198.3 94 8.92
## 4031 243.3 117 10.95
## 1929 141.1 92 6.35
## 3404 165.1 123 7.43
## 20 192.8 74 8.68
## 4136 137.1 94 6.17
## 37 129.3 109 5.82
## 1031 215.5 100 9.70
## 4499 196.3 102 8.83
## 3036 190.6 75 8.58
## 1883 198.8 91 8.95
## 2161 172.9 76 7.78
## 186 241.2 92 10.85
## 4826 126.8 100 5.71
## 2140 284.6 95 12.81
## 4745 161.3 103 7.26
## 4398 211.7 116 9.53
## 3170 201.0 94 9.05
## 4809 166.3 132 7.48
## 3064 141.5 110 6.37
## 1651 104.7 114 4.71
## 1717 147.4 74 6.63
## 1972 234.7 92 10.56
## 3882 180.3 100 8.11
## 193 208.2 119 9.37
## 3703 270.4 119 12.17
## 3349 279.6 90 12.58
## 847 295.6 125 13.30
## 1291 169.0 116 7.61
## 2542 175.1 86 7.88
## 3338 204.5 107 9.20
## 4855 244.2 80 10.99
## 3751 144.7 93 6.51
## 2797 185.4 136 8.34
## 4195 214.6 130 9.66
## 936 255.6 92 11.50
## 1339 206.7 116 9.30
## 4086 175.8 41 7.91
## 3419 201.5 124 9.07
## 1187 195.1 101 8.78
## 212 194.3 94 8.74
## 693 192.1 104 8.64
## 1067 267.9 112 12.06
## 2362 181.5 94 8.17
## 973 154.1 90 6.93
## 3543 169.8 108 7.64
## 39 101.7 107 4.58
## 1849 197.4 62 8.88
## 2532 184.2 87 8.29
## 8 211.8 96 9.53
## 2862 215.7 102 9.71
## 777 251.4 91 11.31
## 1766 127.9 101 5.76
## 3175 325.9 105 14.67
## 3814 173.2 91 7.79
## 2771 188.6 61 8.49
## 1149 211.7 103 9.53
## 443 131.6 103 5.92
## 421 171.8 124 7.73
## 1499 210.2 95 9.46
## 3278 238.5 125 10.73
## 2 254.4 103 11.45
## 1024 122.1 117 5.49
## 4579 195.9 91 8.82
## 4542 205.8 72 9.26
## 3601 232.2 76 10.45
## 1634 285.7 89 12.86
## 2526 197.2 94 8.87
## 3647 217.9 99 9.81
## 3035 104.5 91 4.70
## 3069 124.0 92 5.58
## 1064 241.1 116 10.85
## 1061 172.8 104 7.78
## 1905 130.0 132 5.85
## 4615 200.9 91 9.04
## 4977 210.4 90 9.47
## 3621 181.3 100 8.16
## 4989 113.3 109 5.10
## 2621 297.1 119 13.37
## 12 196.0 94 8.82
## 2978 224.5 117 10.10
## 4092 142.4 95 6.41
## 3674 208.7 100 9.39
## 2213 177.3 130 7.98
## 2618 179.0 89 8.06
## 2626 257.0 106 11.57
## 7 212.6 118 9.57
## 1737 191.4 141 8.61
## 2989 222.4 36 10.01
## 4047 262.5 104 11.81
## 1741 118.9 105 5.35
## 2004 189.9 136 8.55
## 2798 131.2 67 5.90
## 2876 208.7 85 9.39
## 3510 214.7 130 9.66
## 1926 247.5 102 11.14
## 4481 96.9 111 4.36
## 4691 173.6 119 7.81
## 1138 260.9 135 11.74
## 3530 183.2 79 8.24
## 4401 152.3 67 6.85
## 2939 88.7 100 3.99
## 3075 117.9 118 5.31
## 4563 166.6 70 7.50
## 4139 230.1 92 10.35
## 2821 276.7 86 12.45
## 3996 177.0 84 7.97
## 554 108.0 100 4.86
## 3718 206.3 98 9.28
## 3032 238.6 69 10.74
## 722 166.7 80 7.50
## 391 214.7 102 9.66
## 2255 257.6 64 11.59
## 3786 244.5 148 11.00
## 3563 206.0 98 9.27
## 3968 222.2 113 10.00
## 826 182.2 105 8.20
## 4585 82.3 77 3.70
## 1425 252.9 107 11.38
## 724 206.7 136 9.30
## 3489 187.7 77 8.45
## 1572 191.4 72 8.61
## 3776 154.1 80 6.93
## 1912 220.1 100 9.90
## 3289 247.6 113 11.14
## 3759 125.9 141 5.67
## 911 190.4 106 8.57
## 141 193.6 112 8.71
## 658 142.4 89 6.41
## 3293 166.2 122 7.48
## 4525 253.1 116 11.39
## 2664 395.0 72 17.77
## 2912 156.7 122 7.05
## 953 149.5 142 6.73
## 2589 236.7 112 10.65
## 869 143.2 105 6.44
## 2185 228.2 90 10.27
## 1533 171.5 112 7.72
## 562 178.7 130 8.04
## 900 224.2 92 10.09
## 3525 132.1 120 5.94
## 1989 212.5 128 9.56
## 2000 159.0 54 7.15
## 2319 190.8 92 8.59
## 2064 253.1 109 11.39
## 659 186.6 86 8.40
## 3979 179.8 103 8.09
## 2857 226.0 100 10.17
## 3831 309.1 139 13.91
## 3708 276.3 71 12.43
## 4426 122.7 110 5.52
## 4158 297.2 97 13.37
## 1528 221.3 108 9.96
## 1249 223.0 100 10.04
## 3575 201.9 125 9.09
## 3599 171.6 82 7.72
## 4419 198.1 78 8.91
## 3818 235.6 123 10.60
## 642 208.5 102 9.38
## 1385 223.5 115 10.06
## 937 277.4 94 12.48
## 3771 269.4 96 12.12
## 620 266.9 91 12.01
## 621 268.4 121 12.08
## 348 186.7 89 8.40
## 256 173.1 86 7.79
## 2556 233.0 123 10.49
## 540 217.5 92 9.79
## 3569 185.5 137 8.35
## 3512 254.6 83 11.46
## 4249 224.6 104 10.11
## 2482 163.6 109 7.36
## 4088 221.7 100 9.98
## 2125 279.5 96 12.58
## 758 184.6 78 8.31
## 2121 229.8 82 10.34
## 4640 180.9 90 8.14
## 2323 269.7 69 12.14
## 1210 255.8 90 11.51
## 1245 261.8 85 11.78
## 2597 224.2 128 10.09
## 3113 151.0 75 6.80
## 1611 306.6 90 13.80
## 292 125.1 124 5.63
## 2160 279.8 105 12.59
## 4014 227.2 89 10.22
## 2750 171.4 105 7.71
## 1691 221.8 109 9.98
## 4886 140.3 103 6.31
## 4269 189.9 99 8.55
## 2343 213.3 106 9.60
## 821 185.1 96 8.33
## 2595 217.5 107 9.79
## 4593 167.7 75 7.55
## 4911 100.3 132 4.51
## 3918 286.4 93 12.89
## 1466 197.7 84 8.90
## 886 190.8 95 8.59
## 231 217.6 74 9.79
## 1173 132.0 124 5.94
## 1675 220.2 67 9.91
## 759 177.9 95 8.01
## 1450 167.4 83 7.53
## 84 222.7 68 10.02
## 4750 220.6 105 9.93
## 3833 189.4 109 8.52
## 413 168.8 72 7.60
## 4144 223.2 120 10.04
## 2641 299.6 109 13.48
## 2007 210.5 82 9.47
## 322 232.4 89 10.46
## 2672 227.0 89 10.22
## 337 170.2 104 7.66
## 1006 220.0 110 9.90
## 2614 128.9 137 5.80
## 2292 182.3 75 8.20
## 4769 134.0 89 6.03
## 711 259.9 119 11.70
## 2373 213.9 102 9.63
## 4469 188.3 79 8.47
## total_intl_minutes total_intl_calls total_intl_charge
## 4575 17.8 2 4.81
## 4685 10.1 11 2.73
## 1431 7.6 4 2.05
## 4150 9.8 2 2.65
## 3207 10.2 13 2.75
## 2593 12.8 3 3.46
## 3679 11.2 4 3.02
## 673 11.8 5 3.19
## 3280 8.3 5 2.24
## 3519 10.3 2 2.78
## 2285 12.6 5 3.40
## 3588 7.8 11 2.11
## 4663 12.1 1 3.27
## 1274 8.3 2 2.24
## 2305 11.0 9 2.97
## 4686 8.3 5 2.24
## 4876 5.0 4 1.35
## 586 10.2 5 2.75
## 2367 14.0 5 3.78
## 2792 5.5 3 1.49
## 4503 9.8 3 2.65
## 691 14.9 3 4.02
## 4923 13.0 7 3.51
## 4712 9.6 5 2.59
## 411 8.4 5 2.27
## 2559 9.6 5 2.59
## 1941 8.7 4 2.35
## 4505 0.0 0 0.00
## 2223 12.1 9 3.27
## 4156 8.5 2 2.30
## 3666 11.0 5 2.97
## 4031 12.8 9 3.46
## 1929 11.2 5 3.02
## 3404 0.0 0 0.00
## 20 13.0 2 3.51
## 4136 12.6 6 3.40
## 37 14.5 6 3.92
## 1031 9.3 4 2.51
## 4499 7.8 3 2.11
## 3036 11.6 3 3.13
## 1883 12.9 3 3.48
## 2161 7.9 1 2.13
## 186 10.8 13 2.92
## 4826 9.2 1 2.48
## 2140 12.0 5 3.24
## 4745 10.2 3 2.75
## 4398 8.6 9 2.32
## 3170 12.0 3 3.24
## 4809 6.8 3 1.84
## 3064 12.2 2 3.29
## 1651 9.6 2 2.59
## 1717 9.1 2 2.46
## 1972 9.0 4 2.43
## 3882 10.6 3 2.86
## 193 8.8 4 2.38
## 3703 5.9 6 1.59
## 3349 7.4 2 2.00
## 847 10.3 4 2.78
## 1291 9.1 3 2.46
## 2542 14.2 2 3.83
## 3338 7.7 4 2.08
## 4855 11.5 4 3.11
## 3751 12.0 9 3.24
## 2797 16.2 2 4.37
## 4195 8.8 3 2.38
## 936 13.0 3 3.51
## 1339 9.3 3 2.51
## 4086 9.9 4 2.67
## 3419 11.3 12 3.05
## 1187 8.7 5 2.35
## 212 11.9 11 3.21
## 693 5.7 7 1.54
## 1067 11.8 4 3.19
## 2362 10.5 3 2.84
## 973 11.8 3 3.19
## 3543 10.3 3 2.78
## 39 10.5 6 2.84
## 1849 8.6 3 2.32
## 2532 10.1 4 2.73
## 8 7.1 6 1.92
## 2862 13.1 3 3.54
## 777 10.5 4 2.84
## 1766 12.7 2 3.43
## 3175 8.6 6 2.32
## 3814 10.0 3 2.70
## 2771 5.7 3 1.54
## 1149 9.7 7 2.62
## 443 6.8 3 1.84
## 421 9.3 11 2.51
## 1499 11.1 3 3.00
## 3278 10.0 9 2.70
## 2 13.7 3 3.70
## 1024 7.8 8 2.11
## 4579 11.0 2 2.97
## 4542 9.8 4 2.65
## 3601 7.5 8 2.03
## 1634 9.5 3 2.57
## 2526 9.7 2 2.62
## 3647 9.9 4 2.67
## 3035 11.2 8 3.02
## 3069 9.5 1 2.57
## 1064 10.7 2 2.89
## 1061 15.3 3 4.13
## 1905 14.5 4 3.92
## 4615 13.0 3 3.51
## 4977 19.7 4 5.32
## 3621 10.7 8 2.89
## 4989 16.2 2 4.37
## 2621 11.0 7 2.97
## 12 9.1 5 2.46
## 2978 6.7 2 1.81
## 4092 5.4 2 1.46
## 3674 5.3 3 1.43
## 2213 4.8 12 1.30
## 2618 10.7 2 2.89
## 2626 12.0 5 3.24
## 7 7.5 7 2.03
## 1737 6.9 6 1.86
## 2989 12.0 5 3.24
## 4047 8.6 3 2.32
## 1741 9.4 6 2.54
## 2004 13.0 6 3.51
## 2798 10.0 4 2.70
## 2876 6.6 2 1.78
## 3510 9.4 2 2.54
## 1926 9.8 6 2.65
## 4481 7.9 10 2.13
## 4691 11.5 1 3.11
## 1138 11.3 1 3.05
## 3530 6.2 4 1.67
## 4401 11.9 3 3.21
## 2939 7.0 5 1.89
## 3075 6.6 3 1.78
## 4563 14.3 3 3.86
## 4139 8.7 6 2.35
## 2821 13.2 2 3.56
## 3996 11.7 2 3.16
## 554 10.1 6 2.73
## 3718 12.3 2 3.32
## 3032 12.5 3 3.38
## 722 10.9 5 2.94
## 391 11.1 2 3.00
## 2255 6.7 3 1.81
## 3786 13.7 10 3.70
## 3563 9.5 6 2.57
## 3968 14.8 3 4.00
## 826 11.6 6 3.13
## 4585 9.6 4 2.59
## 1425 11.7 5 3.16
## 724 14.0 11 3.78
## 3489 10.5 2 2.84
## 1572 6.1 4 1.65
## 3776 11.5 4 3.11
## 1912 8.2 7 2.21
## 3289 4.9 9 1.32
## 3759 11.9 2 3.21
## 911 8.1 5 2.19
## 141 12.4 3 3.35
## 658 9.3 2 2.51
## 3293 11.7 4 3.16
## 4525 9.8 4 2.65
## 2664 12.7 7 3.43
## 2912 13.0 3 3.51
## 953 10.7 2 2.89
## 2589 12.0 9 3.24
## 869 4.0 7 1.08
## 2185 11.8 5 3.19
## 1533 11.5 7 3.11
## 562 7.2 7 1.94
## 900 8.7 7 2.35
## 3525 7.8 4 2.11
## 1989 12.1 2 3.27
## 2000 10.9 9 2.94
## 2319 11.5 7 3.11
## 2064 10.1 5 2.73
## 659 9.9 7 2.67
## 3979 13.7 2 3.70
## 2857 10.0 4 2.70
## 3831 13.3 10 3.59
## 3708 6.3 2 1.70
## 4426 12.9 3 3.48
## 4158 5.7 2 1.54
## 1528 9.0 2 2.43
## 1249 11.6 8 3.13
## 3575 8.8 4 2.38
## 3599 14.2 4 3.83
## 4419 8.9 3 2.40
## 3818 9.4 3 2.54
## 642 9.8 3 2.65
## 1385 10.1 3 2.73
## 937 9.2 4 2.48
## 3771 14.1 3 3.81
## 620 14.5 4 3.92
## 621 12.8 4 3.46
## 348 10.5 11 2.84
## 256 14.1 8 3.81
## 2556 9.3 4 2.51
## 540 9.9 3 2.67
## 3569 18.7 7 5.05
## 3512 12.1 7 3.27
## 4249 10.2 4 2.75
## 2482 11.3 4 3.05
## 4088 8.8 2 2.38
## 2125 10.7 3 2.89
## 758 13.1 5 3.54
## 2121 13.7 3 3.70
## 4640 6.9 2 1.86
## 2323 10.6 6 2.86
## 1210 12.8 3 3.46
## 1245 7.7 2 2.08
## 2597 5.8 5 1.57
## 3113 9.3 3 2.51
## 1611 12.6 5 3.40
## 292 9.7 3 2.62
## 2160 13.0 2 3.51
## 4014 9.8 5 2.65
## 2750 14.2 6 3.83
## 1691 12.4 9 3.35
## 4886 8.7 1 2.35
## 4269 11.2 1 3.02
## 2343 10.2 2 2.75
## 821 12.7 3 3.43
## 2595 11.8 8 3.19
## 4593 8.3 6 2.24
## 4911 8.6 9 2.32
## 3918 11.0 9 2.97
## 1466 8.6 2 2.32
## 886 5.4 3 1.46
## 231 10.5 2 2.84
## 1173 9.2 5 2.48
## 1675 9.9 1 2.67
## 759 9.8 6 2.65
## 1450 12.7 6 3.43
## 84 9.8 4 2.65
## 4750 8.1 4 2.19
## 3833 7.8 3 2.11
## 413 5.4 1 1.46
## 4144 14.2 5 3.83
## 2641 10.8 3 2.92
## 2007 6.6 2 1.78
## 322 12.9 3 3.48
## 2672 9.5 5 2.57
## 337 10.6 2 2.86
## 1006 9.8 4 2.65
## 2614 7.1 9 1.92
## 2292 11.9 1 3.21
## 4769 6.7 5 1.81
## 711 12.5 4 3.38
## 2373 10.1 3 2.73
## 4469 12.8 3 3.46
## number_customer_service_calls
## 4575 1
## 4685 3
## 1431 1
## 4150 1
## 3207 1
## 2593 5
## 3679 2
## 673 0
## 3280 2
## 3519 3
## 2285 2
## 3588 1
## 4663 3
## 1274 6
## 2305 0
## 4686 2
## 4876 0
## 586 1
## 2367 1
## 2792 0
## 4503 4
## 691 2
## 4923 1
## 4712 0
## 411 2
## 2559 0
## 1941 1
## 4505 1
## 2223 0
## 4156 2
## 3666 5
## 4031 0
## 1929 1
## 3404 1
## 20 1
## 4136 0
## 37 0
## 1031 2
## 4499 0
## 3036 4
## 1883 2
## 2161 1
## 186 1
## 4826 3
## 2140 3
## 4745 3
## 4398 1
## 3170 4
## 4809 2
## 3064 1
## 1651 1
## 1717 1
## 1972 1
## 3882 2
## 193 2
## 3703 3
## 3349 1
## 847 1
## 1291 0
## 2542 2
## 3338 2
## 4855 2
## 3751 1
## 2797 0
## 4195 0
## 936 4
## 1339 2
## 4086 7
## 3419 1
## 1187 0
## 212 0
## 693 4
## 1067 0
## 2362 3
## 973 1
## 3543 0
## 39 3
## 1849 2
## 2532 1
## 8 0
## 2862 4
## 777 0
## 1766 4
## 3175 2
## 3814 3
## 2771 2
## 1149 3
## 443 1
## 421 2
## 1499 0
## 3278 2
## 2 1
## 1024 2
## 4579 1
## 4542 0
## 3601 3
## 1634 0
## 2526 0
## 3647 0
## 3035 1
## 3069 2
## 1064 1
## 1061 1
## 1905 4
## 4615 1
## 4977 0
## 3621 2
## 4989 2
## 2621 1
## 12 0
## 2978 0
## 4092 1
## 3674 0
## 2213 1
## 2618 2
## 2626 2
## 7 3
## 1737 1
## 2989 3
## 4047 1
## 1741 0
## 2004 1
## 2798 2
## 2876 3
## 3510 5
## 1926 2
## 4481 2
## 4691 1
## 1138 1
## 3530 2
## 4401 2
## 2939 1
## 3075 1
## 4563 5
## 4139 0
## 2821 4
## 3996 2
## 554 2
## 3718 5
## 3032 3
## 722 6
## 391 1
## 2255 1
## 3786 3
## 3563 1
## 3968 0
## 826 1
## 4585 2
## 1425 0
## 724 3
## 3489 1
## 1572 1
## 3776 2
## 1912 4
## 3289 1
## 3759 0
## 911 0
## 141 1
## 658 2
## 3293 1
## 4525 1
## 2664 2
## 2912 1
## 953 3
## 2589 2
## 869 0
## 2185 1
## 1533 0
## 562 0
## 900 1
## 3525 1
## 1989 1
## 2000 0
## 2319 0
## 2064 2
## 659 0
## 3979 2
## 2857 0
## 3831 1
## 3708 2
## 4426 4
## 4158 0
## 1528 0
## 1249 0
## 3575 0
## 3599 1
## 4419 0
## 3818 3
## 642 1
## 1385 3
## 937 0
## 3771 2
## 620 0
## 621 1
## 348 1
## 256 4
## 2556 2
## 540 2
## 3569 0
## 3512 2
## 4249 3
## 2482 1
## 4088 1
## 2125 3
## 758 1
## 2121 1
## 4640 3
## 2323 5
## 1210 0
## 1245 1
## 2597 1
## 3113 7
## 1611 4
## 292 0
## 2160 1
## 4014 4
## 2750 1
## 1691 1
## 4886 3
## 4269 2
## 2343 2
## 821 2
## 2595 1
## 4593 4
## 4911 3
## 3918 4
## 1466 1
## 886 2
## 231 0
## 1173 2
## 1675 1
## 759 2
## 1450 4
## 84 1
## 4750 1
## 3833 2
## 413 0
## 4144 3
## 2641 3
## 2007 3
## 322 1
## 2672 2
## 337 0
## 1006 2
## 2614 1
## 2292 0
## 4769 4
## 711 3
## 2373 2
## 4469 0
str(churn_x)
## 'data.frame': 250 obs. of 70 variables:
## $ stateAK : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateAL : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateAR : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateAZ : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateCA : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateCO : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateCT : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateDC : int 0 0 0 0 0 0 0 0 0 1 ...
## $ stateDE : int 0 0 0 1 0 0 0 0 0 0 ...
## $ stateFL : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateGA : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateHI : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateIA : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateID : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateIL : int 0 0 0 0 0 0 0 1 0 0 ...
## $ stateIN : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateKS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateKY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateLA : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateMA : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateMD : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateME : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateMI : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateMN : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateMO : int 0 1 0 0 0 0 0 0 0 0 ...
## $ stateMS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateMT : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateNC : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateND : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateNE : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateNH : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateNJ : int 0 0 0 0 0 0 0 0 1 0 ...
## $ stateNM : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateNV : int 0 0 0 0 0 0 1 0 0 0 ...
## $ stateNY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateOH : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateOK : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateOR : int 0 0 0 0 0 0 0 0 0 0 ...
## $ statePA : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateRI : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateSC : int 1 0 0 0 0 0 0 0 0 0 ...
## $ stateSD : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateTN : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateTX : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateUT : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateVA : int 0 0 0 0 0 1 0 0 0 0 ...
## $ stateVT : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateWA : int 0 0 0 0 1 0 0 0 0 0 ...
## $ stateWI : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stateWV : int 0 0 1 0 0 0 0 0 0 0 ...
## $ stateWY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ account_length : int 137 83 48 67 143 163 100 151 139 17 ...
## $ area_codearea_code_415 : int 0 1 1 1 0 1 1 0 1 0 ...
## $ area_codearea_code_510 : int 1 0 0 0 1 0 0 0 0 1 ...
## $ international_planyes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ voice_mail_planyes : int 0 0 1 0 0 0 1 0 1 1 ...
## $ number_vmail_messages : int 0 0 34 0 0 0 39 0 43 30 ...
## $ total_day_minutes : num 110 197 198 164 133 ...
## $ total_day_calls : int 112 117 70 79 107 100 74 106 85 101 ...
## $ total_day_charge : num 18.7 33.4 33.7 28 22.7 ...
## $ total_eve_minutes : num 224 272 274 110 224 ...
## $ total_eve_calls : int 88 89 121 108 117 46 80 87 82 85 ...
## $ total_eve_charge : num 19 23.12 23.26 9.38 19.03 ...
## $ total_night_minutes : num 248 200 218 204 180 ...
## $ total_night_calls : int 96 62 71 102 85 116 89 88 105 130 ...
## $ total_night_charge : num 11.14 9 9.81 9.18 8.12 ...
## $ total_intl_minutes : num 17.8 10.1 7.6 9.8 10.2 12.8 11.2 11.8 8.3 10.3 ...
## $ total_intl_calls : int 2 11 4 2 13 3 4 5 5 2 ...
## $ total_intl_charge : num 4.81 2.73 2.05 2.65 2.75 3.46 3.02 3.19 2.24 2.78 ...
## $ number_customer_service_calls: int 1 3 1 1 1 5 2 0 2 3 ...
trainControl()
to create a reusable trainControl
for comparing models.
# Create reusable trainControl object: myControl
<- trainControl(
myControl summaryFunction = twoClassSummary,
classProbs = TRUE, # IMPORTANT!
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds
)
Great work! By saving the indexes in the train control, we can fit many models using the same CV folds.
5.2 Reintroducing glmnet
5.2.1 glmnet as a baseline model
What makes glmnet
a good baseline model?
-
It’s simple, fast, and easy to interpret.
-
It always gives poor predictions, so your other models will look good by comparison.
-
Linear models with penalties on their coefficients always give better results.
5.2.2 Fit the baseline model
Now that you have a reusable trainControl
object called myControl
, you can start fitting different predictive models to your churn dataset and evaluate their predictive accuracy.
You’ll start with one of my favorite models, glmnet
, which penalizes linear and logistic regression models on the size and number of coefficients to help prevent overfitting.
Fit a glmnet
model to the churn dataset called model_glmnet
. Make sure to use myControl
, which you created in the first exercise and is available in your workspace, as the trainControl
object.
# Fit glmnet model: model_glmnet
<- train(
model_glmnet x = churn_x,
y = churn_y,
metric = "ROC",
method = "glmnet",
trControl = myControl
)
## + Fold1: alpha=0.10, lambda=0.01821
## - Fold1: alpha=0.10, lambda=0.01821
## + Fold1: alpha=0.55, lambda=0.01821
## - Fold1: alpha=0.55, lambda=0.01821
## + Fold1: alpha=1.00, lambda=0.01821
## - Fold1: alpha=1.00, lambda=0.01821
## + Fold2: alpha=0.10, lambda=0.01821
## - Fold2: alpha=0.10, lambda=0.01821
## + Fold2: alpha=0.55, lambda=0.01821
## - Fold2: alpha=0.55, lambda=0.01821
## + Fold2: alpha=1.00, lambda=0.01821
## - Fold2: alpha=1.00, lambda=0.01821
## + Fold3: alpha=0.10, lambda=0.01821
## - Fold3: alpha=0.10, lambda=0.01821
## + Fold3: alpha=0.55, lambda=0.01821
## - Fold3: alpha=0.55, lambda=0.01821
## + Fold3: alpha=1.00, lambda=0.01821
## - Fold3: alpha=1.00, lambda=0.01821
## + Fold4: alpha=0.10, lambda=0.01821
## - Fold4: alpha=0.10, lambda=0.01821
## + Fold4: alpha=0.55, lambda=0.01821
## - Fold4: alpha=0.55, lambda=0.01821
## + Fold4: alpha=1.00, lambda=0.01821
## - Fold4: alpha=1.00, lambda=0.01821
## + Fold5: alpha=0.10, lambda=0.01821
## - Fold5: alpha=0.10, lambda=0.01821
## + Fold5: alpha=0.55, lambda=0.01821
## - Fold5: alpha=0.55, lambda=0.01821
## + Fold5: alpha=1.00, lambda=0.01821
## - Fold5: alpha=1.00, lambda=0.01821
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0.55, lambda = 0.0182 on full training set
Great work! This model uses our custome CV folds and will be easily compared to other models.
5.3 Reintroducing random forest
5.3.1 Random forest drawback
What’s the drawback of using a random forest model for churn prediction?
-
Tree-based models are usually less accurate than linear models.
-
You no longer have model coefficients to help interpret the model.
-
Nobody else uses random forests to predict churn.
5.3.2 Random forest with custom trainControl
Another one of my favorite models is the random forest, which combines an ensemble of non-linear decision trees into a highly flexible (and usually quite accurate) model.
Rather than using the classic randomForest
package, you’ll be using the ranger
package, which is a re-implementation of randomForest
that produces almost the exact same results, but is faster, more
stable, and uses less memory. I highly recommend it as a starting point
for random forest modeling in R.
churn_x
and churn_y
are loaded in your workspace.
myControl
as the trainControl
like you’ve done before and implement the “ranger”
method.
# Fit random forest: model_rf
<- train(
model_rf x = churn_x,
y = churn_y,
metric = "ROC",
method = "ranger",
trControl = myControl
)
## + Fold1: mtry= 2, min.node.size=1, splitrule=gini
## - Fold1: mtry= 2, min.node.size=1, splitrule=gini
## + Fold1: mtry=36, min.node.size=1, splitrule=gini
## - Fold1: mtry=36, min.node.size=1, splitrule=gini
## + Fold1: mtry=70, min.node.size=1, splitrule=gini
## - Fold1: mtry=70, min.node.size=1, splitrule=gini
## + Fold1: mtry= 2, min.node.size=1, splitrule=extratrees
## - Fold1: mtry= 2, min.node.size=1, splitrule=extratrees
## + Fold1: mtry=36, min.node.size=1, splitrule=extratrees
## - Fold1: mtry=36, min.node.size=1, splitrule=extratrees
## + Fold1: mtry=70, min.node.size=1, splitrule=extratrees
## - Fold1: mtry=70, min.node.size=1, splitrule=extratrees
## + Fold2: mtry= 2, min.node.size=1, splitrule=gini
## - Fold2: mtry= 2, min.node.size=1, splitrule=gini
## + Fold2: mtry=36, min.node.size=1, splitrule=gini
## - Fold2: mtry=36, min.node.size=1, splitrule=gini
## + Fold2: mtry=70, min.node.size=1, splitrule=gini
## - Fold2: mtry=70, min.node.size=1, splitrule=gini
## + Fold2: mtry= 2, min.node.size=1, splitrule=extratrees
## - Fold2: mtry= 2, min.node.size=1, splitrule=extratrees
## + Fold2: mtry=36, min.node.size=1, splitrule=extratrees
## - Fold2: mtry=36, min.node.size=1, splitrule=extratrees
## + Fold2: mtry=70, min.node.size=1, splitrule=extratrees
## - Fold2: mtry=70, min.node.size=1, splitrule=extratrees
## + Fold3: mtry= 2, min.node.size=1, splitrule=gini
## - Fold3: mtry= 2, min.node.size=1, splitrule=gini
## + Fold3: mtry=36, min.node.size=1, splitrule=gini
## - Fold3: mtry=36, min.node.size=1, splitrule=gini
## + Fold3: mtry=70, min.node.size=1, splitrule=gini
## - Fold3: mtry=70, min.node.size=1, splitrule=gini
## + Fold3: mtry= 2, min.node.size=1, splitrule=extratrees
## - Fold3: mtry= 2, min.node.size=1, splitrule=extratrees
## + Fold3: mtry=36, min.node.size=1, splitrule=extratrees
## - Fold3: mtry=36, min.node.size=1, splitrule=extratrees
## + Fold3: mtry=70, min.node.size=1, splitrule=extratrees
## - Fold3: mtry=70, min.node.size=1, splitrule=extratrees
## + Fold4: mtry= 2, min.node.size=1, splitrule=gini
## - Fold4: mtry= 2, min.node.size=1, splitrule=gini
## + Fold4: mtry=36, min.node.size=1, splitrule=gini
## - Fold4: mtry=36, min.node.size=1, splitrule=gini
## + Fold4: mtry=70, min.node.size=1, splitrule=gini
## - Fold4: mtry=70, min.node.size=1, splitrule=gini
## + Fold4: mtry= 2, min.node.size=1, splitrule=extratrees
## - Fold4: mtry= 2, min.node.size=1, splitrule=extratrees
## + Fold4: mtry=36, min.node.size=1, splitrule=extratrees
## - Fold4: mtry=36, min.node.size=1, splitrule=extratrees
## + Fold4: mtry=70, min.node.size=1, splitrule=extratrees
## - Fold4: mtry=70, min.node.size=1, splitrule=extratrees
## + Fold5: mtry= 2, min.node.size=1, splitrule=gini
## - Fold5: mtry= 2, min.node.size=1, splitrule=gini
## + Fold5: mtry=36, min.node.size=1, splitrule=gini
## - Fold5: mtry=36, min.node.size=1, splitrule=gini
## + Fold5: mtry=70, min.node.size=1, splitrule=gini
## - Fold5: mtry=70, min.node.size=1, splitrule=gini
## + Fold5: mtry= 2, min.node.size=1, splitrule=extratrees
## - Fold5: mtry= 2, min.node.size=1, splitrule=extratrees
## + Fold5: mtry=36, min.node.size=1, splitrule=extratrees
## - Fold5: mtry=36, min.node.size=1, splitrule=extratrees
## + Fold5: mtry=70, min.node.size=1, splitrule=extratrees
## - Fold5: mtry=70, min.node.size=1, splitrule=extratrees
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 36, splitrule = extratrees, min.node.size = 1 on full training set
Great work! This random forest uses the custom CV folds, so we can easily compare it to the baseline model.
5.4 Comparing models
5.4.1 Matching train/test indices
What’s the primary reason that train/test indices need to match when comparing two models?
-
You can save a lot of time when fitting your models because you don’t have to remake the datasets.
-
There’s no real reason; it just makes your plots look better.
-
Because otherwise you wouldn’t be doing a fair comparison of your models and your results could be due to chance.
5.4.2 Create a resamples object
Now that you have fit two models to the churn dataset, it’s time to compare their out-of-sample predictions and choose which one is the best model for your dataset.
You can compare models in caret
using the resamples()
function, provided they have the same training data and use the same trainControl
object with preset cross-validation folds. resamples()
takes as input a list of models and can be used to compare dozens of
models at once (though in this case you are only comparing two models).
model_glmnet
and model_rf
are loaded in your workspace.
list()
containing the glmnet
model as item1
and the ranger
model as item2
.
# Create model_list
<- list(item1 = model_glmnet, item2 = model_rf) model_list
resamples()
function and save the resulting object as resamples
.
# Pass model_list to resamples(): resamples
<- resamples(model_list) resamples
summary()
on resamples
.
# Summarize the results
summary(resamples)
##
## Call:
## summary.resamples(object = resamples)
##
## Models: item1, item2
## Number of resamples: 5
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## item1 0.4489390 0.4832007 0.5296198 0.5440319 0.6178286 0.6405714 0
## item2 0.6621353 0.7017020 0.7075429 0.7000982 0.7100571 0.7190539 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## item1 0.9195402 0.9482759 0.9542857 0.9552644 0.9657143 0.9885057 0
## item2 0.9712644 0.9885714 0.9942857 0.9908243 1.0000000 1.0000000 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## item1 0.03846154 0.03846154 0.07692308 0.09476923 0.08 0.24 0
## item2 0.00000000 0.00000000 0.00000000 0.03200000 0.00 0.16 0
Amazing! The resamples function gives us a bunch of options for comparing models, that we’ll explore further in the next exercises.
5.5 More on resamples
5.5.1 Create a box-and-whisker plot
caret
provides a variety of methods to use for comparing models. All of these methods are based on the resamples()
function. My favorite is the box-and-whisker plot, which allows you to
compare the distribution of predictive accuracy (in this case AUC) for
the two models.
In general, you want the model with the higher median AUC, as well as a smaller range between min and max AUC.
You can make this plot using the bwplot()
function, which
makes a box and whisker plot of the model’s out of sample scores. Box
and whisker plots show the median of each distribution as a line and the
interquartile range of each distribution as a box around the median
line. You can pass the metric = “ROC”
argument to the bwplot()
function to show a plot of the model’s out-of-sample ROC scores and choose the model with the highest median ROC.
If you do not specify a metric to plot, bwplot()
will automatically plot 3 of them.
Pass the resamples
object to the bwplot()
function to make a box-and-whisker plot. Look at the resulting plot and
note which model has the higher median ROC statistic. Be sure to specify
which metric you want to plot.
# Create bwplot
bwplot(resamples, metric = "ROC")
Great work! I’m a big fan of box and whisker plots for comparing models.
5.5.2 Create a scatterplot
Another useful plot for comparing models is the scatterplot, also known as the xy-plot. This plot shows you how similar the two models’ performances are on different folds.
It’s particularly useful for identifying if one model is consistently better than the other across all folds, or if there are situations when the inferior model produces better predictions on a particular subset of the data.
Pass the resamples
object to the xyplot()
function. Look at the resulting plot and note how similar the two
models’ predictions are (or are not) on the different folds. Be sure to
specify which metric you want to plot.
# Create xyplot
xyplot(resamples, metric = "ROC")
Nice one! These scatterplots let you see if one model is always better than the other.
5.5.3 Ensembling models
That concludes the course! As a teaser for a future course on making ensembles of caret
models, I’ll show you how to fit a stacked ensemble of models using the caretEnsemble
package.
caretEnsemble
provides the caretList()
function for creating multiple caret
models at once on the same dataset, using the same resampling folds. You can also create your own lists of caret
models.
In this exercise, I’ve made a caretList
for you, containing the glmnet
and ranger
models you fit on the churn dataset. Use the caretStack()
function to make a stack of caret
models, with the two sub-models (glmnet
and ranger
) feeding into another (hopefully more accurate!) caret
model.
caretStack()
function with two arguments, model_list
and method = “glm”
, to ensemble the two models using a logistic regression. Store the result as stack
.
library(caretEnsemble)
<- c(item1 = model_glmnet, item2 = model_rf)
model_list # Create ensemble model: stack
<- caretStack(model_list, method = "glm") stack
summary()
function.
# Look at summary
summary(stack)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4886 -0.4979 -0.4298 -0.4028 2.3499
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.0415 0.6178 3.304 0.000952 ***
## item1.glmnet1 3.3744 0.8527 3.957 7.58e-05 ***
## item2.ranger2 -7.9001 1.2252 -6.448 1.13e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 765.13 on 999 degrees of freedom
## Residual deviance: 719.59 on 997 degrees of freedom
## AIC: 725.59
##
## Number of Fisher Scoring iterations: 5
Great work! The caretEnsemble
package gives you an easy way to combine many caret models. Now for a brief farewell message from Max…