Classifying purchasing behavior Let's fit some GAMs to the csale data, which has the binary purchase outcome variable. * Fit a logistic GAM predicting whether a purchase will be made based only on a smooth of the mortgage_age variable. > str(csale) Classes â€tbl_df’, â€tbl’ and 'data.frame': 1779 obs. of 8 variables: $ purchase : num 0 0 0 0 0 0 0 0 1 0 ... $ n_acts : num 11 0 6 8 8 1 5 0 9 18 ... $ bal_crdt_ratio : num 0 36.1 17.6 12.5 59.1 ... $ avg_prem_balance : num 2494 2494 2494 2494 2494 ... $ retail_crdt_ratio: num 0 11.5 0 0.8 20.8 ... $ avg_fin_balance : num 1767 1767 0 1021 797 ... $ mortgage_age : num 182 139 139 139 93 ... $ cred_limit : num 12500 0 0 0 0 0 0 0 11500 16000 ... # Examine the csale data frame head(csale) str(csale) # Fit a logistic model log_mod <- gam(purchase ~ s(mortgage_age), data = csale, family = binomial, method = "REML") summary(log_mod) Family: binomial Link function: logit Formula: purchase ~ s(mortgage_age) Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.34131 0.05908 -22.7 <2e-16 *** --- Signif. codes: 0 â€***’ 0.001 â€**’ 0.01 â€*’ 0.05 â€.’ 0.1 †’ 1 Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(mortgage_age) 5.226 6.317 30.02 4.11e-05 *** --- Signif. codes: 0 â€***’ 0.001 â€**’ 0.01 â€*’ 0.05 â€.’ 0.1 †’ 1 R-sq.(adj) = 0.0182 Deviance explained = 1.9% -REML = 910.49 Scale est. = 1 n = 1779 > plogis(-1.34131) [1] 0.2072947 What does the log_mod model estimate the probability of a person making a purchase who has mean values of all variables? --> 21%. This is the probability when all smooths are at zero and the only effect is the intercept. ----------------------------------------------------------------------------------------------------------- Purchase behavior with multiple smooths In this exercise, you will fit a logistic GAM that predicts the outcome (purchase) in csale, using all other variables as predictors. * Fit a logistic GAM on all variables. * Print the summary of the model. # Fit a logistic model log_mod2 <- gam(purchase ~ ., data = csale, family = binomial, method = "REML") # View the summary summary(log_mod2) Family: binomial Link function: logit Formula: purchase ~ s(n_acts) + s(bal_crdt_ratio) + s(avg_prem_balance) + s(retail_crdt_ratio) + s(avg_fin_balance) + s(mortgage_age) + s(cred_limit) Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.64060 0.07557 -21.71 <2e-16 *** --- Signif. codes: 0 â€***’ 0.001 â€**’ 0.01 â€*’ 0.05 â€.’ 0.1 †’ 1 Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(n_acts) 3.474 4.310 93.670 < 2e-16 *** s(bal_crdt_ratio) 4.308 5.257 18.386 0.00319 ** s(avg_prem_balance) 2.275 2.816 7.800 0.04889 * s(retail_crdt_ratio) 1.001 1.001 1.422 0.23343 s(avg_fin_balance) 1.850 2.202 2.506 0.27889 s(mortgage_age) 4.669 5.710 9.656 0.13356 s(cred_limit) 1.001 1.002 23.066 2.37e-06 *** --- Signif. codes: 0 â€***’ 0.001 â€**’ 0.01 â€*’ 0.05 â€.’ 0.1 †’ 1 R-sq.(adj) = 0.184 Deviance explained = 18.4% -REML = 781.37 Scale est. = 1 n = 1779 Which term in the model is significant but approximately linear? --> s(cred_limit) This term is linear with edf near one, and with p < 0.05. These same terms apply in the case of logistic GAMs. ----------------------------------------------------------------------------------------------------------- Visualizing influences on purchase probability Let's try plotting and interpreting the purchasing behavior model from the last lesson. You'll step through several iterations of plots of log_mod2, moving from raw plots on the fitting scale towards plots on the response scale with more natural interpretations. The model (log_mod2) from the previous exercise is available in your workspace. * Plot all partial effects of log_mod2 on the log-odds scale. Put all plots on a single page. * Convert the plots to the probability scale using the trans argument. * Convert the plots to probability centered on the intercept with the shift argument. * Add intercept-related uncertainty to the plots using the seWithMean argument. Each transformation brings us to a more interpretable output. # Plot on the log-odds scale plot(log_mod2, pages = 1) # Plot on the probability scale plot(log_mod2, pages = 1, trans = plogis) # Plot with the intercept plot(log_mod2, pages = 1, trans = plogis, shift = coef(log_mod2)[1]) # Plot with intercept uncertainty plot(log_mod2, pages = 1, trans = plogis, shift = coef(log_mod2)[1], seWithMean = TRUE) For the next few questions, you'll inspect the partial effects plots of some of the terms in log_mod2. Q1: All else being equal, which of these variables has the largest effect on purchase probability? A1: n_acts has the largest effect on purchase probability. Q2: What is the expected purchase probability of a person with 20 accounts (n_acts = 20) if all other values are average? A2: When n_acts is 20 the predicted probability of purchase is about 0.55, all else being equal. Q3: Which of these predictions has the greatest uncertainty, assuming all other variables are at average levels? A3: The probability of purchase when avg_fin_balance is 2000. The probability of purchase when mortgage_age is 50. The probability of purchase when avg_fin_balance is 5000. <-- The probability of purchase when mortgage_age is 150. There are wide confidence intervals around the avg_fin_balance smooth at a value at 5000. ----------------------------------------------------------------------------------------------------------- Predicting purchase behavior and uncertainty The log_mod2 purchase behavior model lets you make predictions off credit data. In this exercise, you'll use a new set of data, new_credit_data, and calculate predicted outcomes and confidence bounds. The model (log_mod2) from the exercise 3 is available in your workspace. * Use the model log_mod2 to calculate the predicted purchase log-odds, and standard errors for those predictions, for the observations in new_credit_data. * Using your predictions and standard errors, calculate high and low confidence bounds for the log-odds of purchase for each observation. * Calculate the high confidence bound as two standard errors above the mean prediction, and the low confidence bound as two standard errors below the mean prediction. * Convert your calculated confidence bounds from the log-odds scale to the probability scale. > new_credit_data purchase n_acts bal_crdt_ratio avg_prem_balance retail_crdt_ratio 1 1 2 0.30000 61.000 11.49123 2 0 19 4.20000 967.000 0.00000 3 0 0 36.09506 2494.414 11.49123 4 0 0 36.09506 2494.414 11.49123 5 0 1 25.70000 2494.414 11.49123 6 0 6 45.60000 195.000 0.00000 7 0 3 10.80000 2494.414 11.49123 avg_fin_balance mortgage_age cred_limit 1 1767.197 155.0000 0 2 249.000 65.0000 10000 3 1767.197 138.9601 0 4 1767.197 138.9601 0 5 1767.197 138.9601 0 6 0.000 13.0000 13800 7 1767.197 138.9601 0 # Calculate predictions and errors predictions <- predict(log_mod2, newdata = new_credit_data, type = "link", se.fit = TRUE) # Inspect the predictions predictions $fit 1 2 3 4 5 6 -1.20182840 0.07182382 -3.03331719 -3.03331719 -2.43727488 -1.89483336 7 -1.35595902 $se.fit 1 2 3 4 5 6 7 0.2614862 0.3608447 0.2238734 0.2238734 0.1831934 0.3806117 0.2019073 # Calculate high and low prediction intervals high_pred <- predictions$fit + 2 * predictions$se.fit low_pred <- predictions$fit - 2 * predictions$se.fit # Convert intervals to probability scale high_prob <- plogis(high_pred) low_prob <- plogis(low_pred) # Inspect high_prob low_prob high_prob 1 2 3 4 5 6 7 0.33651666 0.68858519 0.07007288 0.07007288 0.11195872 0.24349553 0.27845378 low_prob 1 2 3 4 5 6 7 0.15125383 0.34301982 0.02985584 0.02985584 0.05712662 0.06561668 0.14681869 ----------------------------------------------------------------------------------------------------------- Explaining individual behaviors In the final exercise of this chapter, you will use the model log_mod2 to predict the contribution of each term to the prediction of one row in new_credit_data. * Predict the effect of each model term on the output for the first row of new_credit_data. # Predict from the model prediction_1 <- predict(log_mod2, newdata = new_credit_data, type = "term")[1,] # Inspect prediction_1 s(n_acts) s(bal_crdt_ratio) s(avg_prem_balance) -0.420012186 0.455136892 0.238379001 s(retail_crdt_ratio) s(avg_fin_balance) s(mortgage_age) 0.001031828 0.016478436 0.025013275 s(cred_limit) 0.380455757 Which term makes the greatest contribution to the prediction of this first data point? For this data point, s(bal_crdt_ratio) has the greatest contribution to the prediction. Note this is despite the fact that s(n_acts) has a larger overall effect, as we saw in the last section.