Classifying purchasing behavior

Let's fit some GAMs to the csale data, which has the binary purchase outcome variable.

* Fit a logistic GAM predicting whether a purchase will be made based only on a smooth of the mortgage_age
  variable.


>
str(csale)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	1779 obs. of  8 variables:
 $ purchase         : num  0 0 0 0 0 0 0 0 1 0 ...
 $ n_acts           : num  11 0 6 8 8 1 5 0 9 18 ...
 $ bal_crdt_ratio   : num  0 36.1 17.6 12.5 59.1 ...
 $ avg_prem_balance : num  2494 2494 2494 2494 2494 ...
 $ retail_crdt_ratio: num  0 11.5 0 0.8 20.8 ...
 $ avg_fin_balance  : num  1767 1767 0 1021 797 ...
 $ mortgage_age     : num  182 139 139 139 93 ...
 $ cred_limit       : num  12500 0 0 0 0 0 0 0 11500 16000 ...
 

# Examine the csale data frame
head(csale)
str(csale)

# Fit a logistic model
log_mod <- gam(purchase ~ s(mortgage_age), data = csale, 
               family = binomial, 
               method = "REML")

summary(log_mod)


Family: binomial 
Link function: logit 

Formula:
purchase ~ s(mortgage_age)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.34131    0.05908   -22.7   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                  edf Ref.df Chi.sq  p-value    
s(mortgage_age) 5.226  6.317  30.02 4.11e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.0182   Deviance explained =  1.9%
-REML = 910.49  Scale est. = 1         n = 1779

>
plogis(-1.34131)
[1] 0.2072947


What does the log_mod model estimate the probability of a person making a purchase who has mean values of
all variables?
--> 21%. This is the probability when all smooths are at zero and the only effect is the intercept. 

-----------------------------------------------------------------------------------------------------------

Purchase behavior with multiple smooths

In this exercise, you will fit a logistic GAM that predicts the outcome (purchase) in csale, using all
other variables as predictors.

* Fit a logistic GAM on all variables.
* Print the summary of the model.


# Fit a logistic model
log_mod2 <- gam(purchase ~ .,
                data = csale,
                family = binomial,
                method = "REML")

# View the summary
summary(log_mod2)


Family: binomial 
Link function: logit 

Formula:
purchase ~ s(n_acts) + s(bal_crdt_ratio) + s(avg_prem_balance) + 
    s(retail_crdt_ratio) + s(avg_fin_balance) + s(mortgage_age) + 
    s(cred_limit)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.64060    0.07557  -21.71   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                       edf Ref.df Chi.sq  p-value    
s(n_acts)            3.474  4.310 93.670  < 2e-16 ***
s(bal_crdt_ratio)    4.308  5.257 18.386  0.00319 ** 
s(avg_prem_balance)  2.275  2.816  7.800  0.04889 *  
s(retail_crdt_ratio) 1.001  1.001  1.422  0.23343    
s(avg_fin_balance)   1.850  2.202  2.506  0.27889    
s(mortgage_age)      4.669  5.710  9.656  0.13356    
s(cred_limit)        1.001  1.002 23.066 2.37e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.184   Deviance explained = 18.4%
-REML = 781.37  Scale est. = 1         n = 1779


Which term in the model is significant but approximately linear?
--> s(cred_limit)
This term is linear with edf near one, and with p < 0.05. These same terms apply 
in the case of logistic GAMs. 

-----------------------------------------------------------------------------------------------------------

Visualizing influences on purchase probability

Let's try plotting and interpreting the purchasing behavior model from the last lesson. You'll step 
through several iterations of plots of log_mod2, moving from raw plots on the fitting scale towards 
plots on the response scale with more natural interpretations.

The model (log_mod2) from the previous exercise is available in your workspace.

* Plot all partial effects of log_mod2 on the log-odds scale. Put all plots on a single page.
* Convert the plots to the probability scale using the trans argument.
* Convert the plots to probability centered on the intercept with the shift argument.
* Add intercept-related uncertainty to the plots using the seWithMean argument.

Each transformation brings us to a more interpretable output.


# Plot on the log-odds scale
plot(log_mod2, pages = 1)

# Plot on the probability scale
plot(log_mod2, pages = 1, trans = plogis)

# Plot with the intercept
plot(log_mod2, pages = 1, trans = plogis, 
     shift = coef(log_mod2)[1])

# Plot with intercept uncertainty
plot(log_mod2, pages = 1, trans = plogis, 
     shift = coef(log_mod2)[1], seWithMean = TRUE)


For the next few questions, you'll inspect the partial effects plots of some of the terms in log_mod2.

Q1: All else being equal, which of these variables has the largest effect on purchase probability?
A1: n_acts has the largest effect on purchase probability.

Q2: What is the expected purchase probability of a person with 20 accounts (n_acts = 20) if all other
    values are average?
A2: When n_acts is 20 the predicted probability of purchase is about 0.55, all else being equal.

Q3: Which of these predictions has the greatest uncertainty, assuming all other variables are at 
    average levels?
A3: The probability of purchase when avg_fin_balance is 2000.
    The probability of purchase when mortgage_age is 50.
    The probability of purchase when avg_fin_balance is 5000. <--
    The probability of purchase when mortgage_age is 150.
There are wide confidence intervals around the avg_fin_balance smooth at a value at 5000.

-----------------------------------------------------------------------------------------------------------

Predicting purchase behavior and uncertainty

The log_mod2 purchase behavior model lets you make predictions off credit data. In this exercise, you'll
use a new set of data, new_credit_data, and calculate predicted outcomes and confidence bounds.

The model (log_mod2) from the exercise 3 is available in your workspace.

* Use the model log_mod2 to calculate the predicted purchase log-odds, and standard errors for those
   predictions, for the observations in new_credit_data.
* Using your predictions and standard errors, calculate high and low confidence bounds for the log-odds
  of purchase for each observation.
* Calculate the high confidence bound as two standard errors above the mean prediction, and the low
  confidence bound as two standard errors below the mean prediction. 
* Convert your calculated confidence bounds from the log-odds scale to the probability scale.

>
new_credit_data
  purchase n_acts bal_crdt_ratio avg_prem_balance retail_crdt_ratio
1        1      2        0.30000           61.000          11.49123
2        0     19        4.20000          967.000           0.00000
3        0      0       36.09506         2494.414          11.49123
4        0      0       36.09506         2494.414          11.49123
5        0      1       25.70000         2494.414          11.49123
6        0      6       45.60000          195.000           0.00000
7        0      3       10.80000         2494.414          11.49123
  avg_fin_balance mortgage_age cred_limit
1        1767.197     155.0000          0
2         249.000      65.0000      10000
3        1767.197     138.9601          0
4        1767.197     138.9601          0
5        1767.197     138.9601          0
6           0.000      13.0000      13800
7        1767.197     138.9601          0

# Calculate predictions and errors
predictions <- predict(log_mod2, newdata = new_credit_data, 
                       type = "link", se.fit = TRUE)

# Inspect the predictions
predictions


$fit
          1           2           3           4           5           6 
-1.20182840  0.07182382 -3.03331719 -3.03331719 -2.43727488 -1.89483336 
          7 
-1.35595902 

$se.fit
        1         2         3         4         5         6         7 
0.2614862 0.3608447 0.2238734 0.2238734 0.1831934 0.3806117 0.2019073 


# Calculate high and low prediction intervals
high_pred <- predictions$fit + 2 * predictions$se.fit
low_pred <- predictions$fit - 2 * predictions$se.fit

# Convert intervals to probability scale
high_prob <- plogis(high_pred)
low_prob <- plogis(low_pred)

# Inspect
high_prob
low_prob


high_prob
         1          2          3          4          5          6          7 
0.33651666 0.68858519 0.07007288 0.07007288 0.11195872 0.24349553 0.27845378 
low_prob
         1          2          3          4          5          6          7 
0.15125383 0.34301982 0.02985584 0.02985584 0.05712662 0.06561668 0.14681869 

-----------------------------------------------------------------------------------------------------------

Explaining individual behaviors

In the final exercise of this chapter, you will use the model log_mod2 to predict the contribution of 
each term to the prediction of one row in new_credit_data.

* Predict the effect of each model term on the output for the first row of new_credit_data.


# Predict from the model
prediction_1 <- predict(log_mod2, 
                        newdata = new_credit_data, 
                        type = "term")[1,]

# Inspect
prediction_1

           s(n_acts)    s(bal_crdt_ratio)  s(avg_prem_balance) 
        -0.420012186          0.455136892          0.238379001 
s(retail_crdt_ratio)   s(avg_fin_balance)      s(mortgage_age) 
         0.001031828          0.016478436          0.025013275 
       s(cred_limit) 
         0.380455757 

Which term makes the greatest contribution to the prediction of this first data point?
For this data point, s(bal_crdt_ratio) has the greatest contribution to the prediction. 
Note this is despite the fact that s(n_acts) has a larger overall effect, as we saw in the last section.