Classifying purchasing behavior

Let's fit some GAMs to the csale data, which has the binary purchase outcome variable.

* Fit a logistic GAM predicting whether a purchase will be made based only on a smooth of the mortgage_age

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	1779 obs. of  8 variables:
 $ purchase         : num  0 0 0 0 0 0 0 0 1 0 ...
 $ n_acts           : num  11 0 6 8 8 1 5 0 9 18 ...
 $ bal_crdt_ratio   : num  0 36.1 17.6 12.5 59.1 ...
 $ avg_prem_balance : num  2494 2494 2494 2494 2494 ...
 $ retail_crdt_ratio: num  0 11.5 0 0.8 20.8 ...
 $ avg_fin_balance  : num  1767 1767 0 1021 797 ...
 $ mortgage_age     : num  182 139 139 139 93 ...
 $ cred_limit       : num  12500 0 0 0 0 0 0 0 11500 16000 ...

# Examine the csale data frame

# Fit a logistic model
log_mod <- gam(purchase ~ s(mortgage_age), data = csale, 
               family = binomial, 
               method = "REML")


Family: binomial 
Link function: logit 

purchase ~ s(mortgage_age)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.34131    0.05908   -22.7   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                  edf Ref.df Chi.sq  p-value    
s(mortgage_age) 5.226  6.317  30.02 4.11e-05 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.0182   Deviance explained =  1.9%
-REML = 910.49  Scale est. = 1         n = 1779

[1] 0.2072947

What does the log_mod model estimate the probability of a person making a purchase who has mean values of
all variables?
--> 21%. This is the probability when all smooths are at zero and the only effect is the intercept. 


Purchase behavior with multiple smooths

In this exercise, you will fit a logistic GAM that predicts the outcome (purchase) in csale, using all
other variables as predictors.

* Fit a logistic GAM on all variables.
* Print the summary of the model.

# Fit a logistic model
log_mod2 <- gam(purchase ~ .,
                data = csale,
                family = binomial,
                method = "REML")

# View the summary

Family: binomial 
Link function: logit 

purchase ~ s(n_acts) + s(bal_crdt_ratio) + s(avg_prem_balance) + 
    s(retail_crdt_ratio) + s(avg_fin_balance) + s(mortgage_age) + 

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.64060    0.07557  -21.71   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                       edf Ref.df Chi.sq  p-value    
s(n_acts)            3.474  4.310 93.670  < 2e-16 ***
s(bal_crdt_ratio)    4.308  5.257 18.386  0.00319 ** 
s(avg_prem_balance)  2.275  2.816  7.800  0.04889 *  
s(retail_crdt_ratio) 1.001  1.001  1.422  0.23343    
s(avg_fin_balance)   1.850  2.202  2.506  0.27889    
s(mortgage_age)      4.669  5.710  9.656  0.13356    
s(cred_limit)        1.001  1.002 23.066 2.37e-06 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.184   Deviance explained = 18.4%
-REML = 781.37  Scale est. = 1         n = 1779

Which term in the model is significant but approximately linear?
--> s(cred_limit)
This term is linear with edf near one, and with p < 0.05. These same terms apply 
in the case of logistic GAMs. 


Visualizing influences on purchase probability

Let's try plotting and interpreting the purchasing behavior model from the last lesson. You'll step 
through several iterations of plots of log_mod2, moving from raw plots on the fitting scale towards 
plots on the response scale with more natural interpretations.

The model (log_mod2) from the previous exercise is available in your workspace.

* Plot all partial effects of log_mod2 on the log-odds scale. Put all plots on a single page.
* Convert the plots to the probability scale using the trans argument.
* Convert the plots to probability centered on the intercept with the shift argument.
* Add intercept-related uncertainty to the plots using the seWithMean argument.

Each transformation brings us to a more interpretable output.

# Plot on the log-odds scale
plot(log_mod2, pages = 1)

# Plot on the probability scale
plot(log_mod2, pages = 1, trans = plogis)

# Plot with the intercept
plot(log_mod2, pages = 1, trans = plogis, 
     shift = coef(log_mod2)[1])

# Plot with intercept uncertainty
plot(log_mod2, pages = 1, trans = plogis, 
     shift = coef(log_mod2)[1], seWithMean = TRUE)

For the next few questions, you'll inspect the partial effects plots of some of the terms in log_mod2.

Q1: All else being equal, which of these variables has the largest effect on purchase probability?
A1: n_acts has the largest effect on purchase probability.

Q2: What is the expected purchase probability of a person with 20 accounts (n_acts = 20) if all other
    values are average?
A2: When n_acts is 20 the predicted probability of purchase is about 0.55, all else being equal.

Q3: Which of these predictions has the greatest uncertainty, assuming all other variables are at 
    average levels?
A3: The probability of purchase when avg_fin_balance is 2000.
    The probability of purchase when mortgage_age is 50.
    The probability of purchase when avg_fin_balance is 5000. <--
    The probability of purchase when mortgage_age is 150.
There are wide confidence intervals around the avg_fin_balance smooth at a value at 5000.


Predicting purchase behavior and uncertainty

The log_mod2 purchase behavior model lets you make predictions off credit data. In this exercise, you'll
use a new set of data, new_credit_data, and calculate predicted outcomes and confidence bounds.

The model (log_mod2) from the exercise 3 is available in your workspace.

* Use the model log_mod2 to calculate the predicted purchase log-odds, and standard errors for those
   predictions, for the observations in new_credit_data.
* Using your predictions and standard errors, calculate high and low confidence bounds for the log-odds
  of purchase for each observation.
* Calculate the high confidence bound as two standard errors above the mean prediction, and the low
  confidence bound as two standard errors below the mean prediction. 
* Convert your calculated confidence bounds from the log-odds scale to the probability scale.

  purchase n_acts bal_crdt_ratio avg_prem_balance retail_crdt_ratio
1        1      2        0.30000           61.000          11.49123
2        0     19        4.20000          967.000           0.00000
3        0      0       36.09506         2494.414          11.49123
4        0      0       36.09506         2494.414          11.49123
5        0      1       25.70000         2494.414          11.49123
6        0      6       45.60000          195.000           0.00000
7        0      3       10.80000         2494.414          11.49123
  avg_fin_balance mortgage_age cred_limit
1        1767.197     155.0000          0
2         249.000      65.0000      10000
3        1767.197     138.9601          0
4        1767.197     138.9601          0
5        1767.197     138.9601          0
6           0.000      13.0000      13800
7        1767.197     138.9601          0

# Calculate predictions and errors
predictions <- predict(log_mod2, newdata = new_credit_data, 
                       type = "link", = TRUE)

# Inspect the predictions

          1           2           3           4           5           6 
-1.20182840  0.07182382 -3.03331719 -3.03331719 -2.43727488 -1.89483336 

        1         2         3         4         5         6         7 
0.2614862 0.3608447 0.2238734 0.2238734 0.1831934 0.3806117 0.2019073 

# Calculate high and low prediction intervals
high_pred <- predictions$fit + 2 * predictions$
low_pred <- predictions$fit - 2 * predictions$

# Convert intervals to probability scale
high_prob <- plogis(high_pred)
low_prob <- plogis(low_pred)

# Inspect

         1          2          3          4          5          6          7 
0.33651666 0.68858519 0.07007288 0.07007288 0.11195872 0.24349553 0.27845378 
         1          2          3          4          5          6          7 
0.15125383 0.34301982 0.02985584 0.02985584 0.05712662 0.06561668 0.14681869 


Explaining individual behaviors

In the final exercise of this chapter, you will use the model log_mod2 to predict the contribution of 
each term to the prediction of one row in new_credit_data.

* Predict the effect of each model term on the output for the first row of new_credit_data.

# Predict from the model
prediction_1 <- predict(log_mod2, 
                        newdata = new_credit_data, 
                        type = "term")[1,]

# Inspect

           s(n_acts)    s(bal_crdt_ratio)  s(avg_prem_balance) 
        -0.420012186          0.455136892          0.238379001 
s(retail_crdt_ratio)   s(avg_fin_balance)      s(mortgage_age) 
         0.001031828          0.016478436          0.025013275 

Which term makes the greatest contribution to the prediction of this first data point?
For this data point, s(bal_crdt_ratio) has the greatest contribution to the prediction. 
Note this is despite the fact that s(n_acts) has a larger overall effect, as we saw in the last section.