Significance and linearity (I)

It's time for you to summarize a model and interpret the output.

* Summarize the mod_city4 model

>
str(mpg)
'data.frame':	205 obs. of  26 variables:
 $ symbol    : int  3 3 1 2 2 2 1 1 1 0 ...
 $ loss      : int  NA NA NA 164 164 NA 158 NA 158 NA ...
 $ make      : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
 $ fuel      : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
 $ aspir     : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
 $ doors     : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ...
 $ style     : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
 $ drive     : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
 $ eng.loc   : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
 $ wb        : num  88.6 88.6 94.5 99.8 99.4 ...
 $ length    : num  169 169 171 177 177 ...
 $ width     : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
 $ height    : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
 $ weight    : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
 $ eng.type  : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
 $ cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
 $ eng.cc    : int  130 130 152 109 136 136 136 136 131 131 ...
 $ fuel.sys  : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ bore      : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
 $ stroke    : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
 $ comp.ratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
 $ hp        : int  111 111 154 102 115 110 110 110 140 160 ...
 $ rpm       : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
 $ city.mpg  : int  21 21 19 24 18 19 19 19 17 16 ...
 $ hw.mpg    : int  27 27 26 30 22 25 25 25 20 22 ...
 $ price     : int  13495 16500 16500 13950 17450 15250 17710 18920 23875 NA ...
 

library(mgcv)
# Fit the model
mod_city4 <- gam(city.mpg ~ s(weight) + s(length) + s(price) + s(rpm) + s(width),
                 data = mpg, method = "REML")

# View the summary
summary(mod_city4)

Family: gaussian 
Link function: identity 

Formula:
city.mpg ~ s(weight) + s(length) + s(price) + s(rpm) + s(width)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   25.201      0.188     134   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
            edf Ref.df      F  p-value    
s(weight) 5.620  6.799 17.524  < 2e-16 ***
s(length) 2.943  3.759  0.904    0.421    
s(price)  1.000  1.000 16.647 6.79e-05 ***
s(rpm)    7.751  8.499 16.486  < 2e-16 ***
s(width)  1.003  1.005  0.006    0.954    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.831   Deviance explained = 84.7%
-REML = 496.47  Scale est. = 7.0365    n = 199


Which smooth term in this model is significant and linear?
--> price is significant (p <0.05) and linear (edf near 1).

Which smooth term is non-significant and non-linear?
--> length

-----------------------------------------------------------------------------------------------------------

Plotting the motorcycle crash model and data

For our first plotting exercise, we'll add partial residuals to the partial effect plot of a GAM so as to
compare the model to the data.

* Plot the model (mod) that uses mcycle data. Add partial residuals to the plot.
* Make a second plot, making partial residuals more visible by changing the shape using the pch argument,
  and size of the residuals using the cex argument. Set both the pch and cex arguments to 1.

>
str(mcycle)
'data.frame':	133 obs. of  2 variables:
 $ times: num  2.4 2.6 3.2 3.6 4 6.2 6.6 6.8 7.8 8.2 ...
 $ accel: num  0 -1.3 -2.7 0 -2.7 -2.7 -2.7 -1.3 -2.7 -2.7 ...
 
library(mgcv)
# Fit the model
mod <- gam(accel ~ s(times), data = mcycle, method = "REML")

# Make the plot with residuals
plot(mod, residuals = TRUE)

# Change shape of residuals
plot(mod, residuals = TRUE, pch = 1, cex = 1)

Plotting residuals helps you understand the quality of your model fit

-----------------------------------------------------------------------------------------------------------

Plotting multiple auto performance variables

In plotting GAMs, you sometimes want to look at just parts of a model, or all the terms in model. Here
you'll practice selecting which terms to visualize.

library(mgcv)
# Fit the model
mod <- gam(hw.mpg ~ s(weight) + s(rpm) + s(price) + comp.ratio, 
           data = mpg, method = "REML")

# Plot the price effect
plot(mod, select = c(3))

# Plot all effects
plot(mod, pages = 1, all.terms = TRUE)

A plot like this gives a good overview of everything going on in the model. 

-----------------------------------------------------------------------------------------------------------

Visualizing auto performance uncertainty

Confidence intervals are a very important visual indicator of model fit. Here you'll practice changing
the appearance of confidence intervals and transforming the scale of partial effects plots.

* Plot the model (mod) that uses the mpg data, plotting only the partial effect of weight. 
  Make the confidence interval shaded and "hotpink" in color.
* Make another plot of the weight partial effect, this time shifting the scale by the value of the
  intercept using the shift argument, and including the uncertainty of the model intercept using the
  seWithMean argument.

library(mgcv)
# Fit the model
mod <- gam(hw.mpg ~ s(weight) + s(rpm) + s(price) + comp.ratio, 
           data = mpg, method = "REML")

# Plot the weight effect with colored shading
plot(mod, select = 1, shade = TRUE, shade.col = "hotpink")

# Make another plot adding the intercept value and uncertainty
plot(mod, select = 1, seWithMean = TRUE, shift = coef(mod)[1])

This plot is very helpful for explaining or interpreting models results. 
Incorporating the intercept into the partial effect plot makes the Y-axis naturally interpretable.

-----------------------------------------------------------------------------------------------------------

Reading model diagnostics

gam.check() helps you understand whether you have enough basis functions to model the data.

>
str(dat)
'data.frame':	200 obs. of  10 variables:
 $ y : num  11.17 2.81 12.9 5.68 5.58 ...
 $ x0: num  0.897 0.266 0.372 0.573 0.908 ...
 $ x1: num  0.783 0.268 0.219 0.517 0.269 ...
 $ x2: num  0.148 0.659 0.185 0.954 0.898 ...
 $ x3: num  0.45 0.814 0.929 0.147 0.75 ...
 $ f : num  11.95 6.45 11.58 4.76 2.35 ...
 $ f0: num  0.638 1.481 1.841 1.948 0.569 ...
 $ f1: num  4.79 1.71 1.55 2.81 1.71 ...
 $ f2: num  6.52548 3.26245 8.18914 0.00108 0.06946 ...
 $ f3: num  0 0 0 0 0 0 0 0 0 0 ...

library(mgcv)
# Fit the model
mod <- gam(y ~ s(x0, k = 5) + s(x1, k = 5) + s(x2, k = 5) + s(x3, k = 5),
           data = dat, method = "REML")

# Run the check function
gam.check(mod)

Method: REML   Optimizer: outer newton
full convergence after 10 iterations.
Gradient range [-0.0001190691,0.0001259251]
(score 460.9549 & scale 5.229925).
Hessian positive definite, eigenvalue range [0.0001190726,97.53256].
Model rank =  17 / 17 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

        k'  edf k-index p-value  
s(x0) 4.00 2.54    0.99   0.410  
s(x1) 4.00 2.25    0.95   0.300  
s(x2) 4.00 3.94    0.84   0.015 *
s(x3) 4.00 1.00    0.99   0.435  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Which smooths do not have sufficient numbers of basis functions?
--> x2 does not have enough basis functions because it has a significant result in the diagnostic test.

-----------------------------------------------------------------------------------------------------------

Fixing problems with model diagnostics

You can use gam.check() to improve models by updating them based on its results.

* Run the model diagnostics on mod.
* Based on the diagnostics, re-fit the model as mod2, changing the number of basis functions (k) for
  failing smooths.
* Run the model diagnostics on mod2 to ensure you have fixed the issue.

library(mgcv)
# Fit the model
mod <- gam(y ~ s(x0, k = 3) + s(x1, k = 3) + s(x2, k = 3) + s(x3, k = 3),
           data = dat, method = "REML")

# Check the diagnostics
gam.check(mod)


Method: REML   Optimizer: outer newton
full convergence after 10 iterations.
Gradient range [-2.133524e-06,2.107185e-06]
(score 1262.659 & scale 3.777109).
Hessian positive definite, eigenvalue range [0.2592108,297.5029].
Model rank =  9 / 9 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

        k'  edf k-index p-value    
s(x0) 2.00 1.99    0.98    0.32    
s(x1) 2.00 1.95    1.07    0.94    
s(x2) 2.00 2.00    0.11  <2e-16 ***
s(x3) 2.00 1.72    1.03    0.78    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


# Refit to fix issues
mod2 <- gam(y ~ s(x0, k = 3) + s(x1, k = 3) + s(x2, k = 12) + s(x3, k = 3),
           data = dat, method = "REML")

# Check the new model
gam.check(mod2)


Method: REML   Optimizer: outer newton
full convergence after 9 iterations.
Gradient range [-3.827017e-08,3.651908e-08]
(score 615.861 & scale 0.3924265).
Hessian positive definite, eigenvalue range [0.3995874,297.5842].
Model rank =  18 / 18 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

         k'   edf k-index p-value
s(x0)  2.00  2.00    1.07    0.96
s(x1)  2.00  2.00    0.97    0.20
s(x2) 11.00 10.79    1.01    0.56
s(x3)  2.00  1.89    0.98    0.29


As you can see, model-fitting is often an iterative process of fitting and checking.

-----------------------------------------------------------------------------------------------------------

Examining overall concurvity in auto data

Let's take a look at concurvity in the fuel efficiency model variables.


library(mgcv)
# Fit the model
mod <- gam(hw.mpg ~ s(length) + s(width) + s(height) + s(weight), 
           data = mpg, method = "REML")

# Check overall concurvity
concurvity(mod, full = TRUE)

                 para s(length)  s(width) s(height) s(weight)
worst    1.079374e-20 0.9303404 0.9322887 0.6723705 0.9603887
observed 1.079374e-20 0.7534619 0.8757513 0.4869308 0.8793300
estimate 1.079374e-20 0.8353324 0.7943374 0.4452676 0.8567519

Which smooth is least pre-determined by all the other variables?
--> s(height)

height has relatively low concurvity. It isn't too similar to any of the other variables.

-----------------------------------------------------------------------------------------------------------

Examining concurvity between auto variables

Now, let's look at concurvity between model variables.

After examining the pairwise concurvity between variables in mod, answer the following question:

Which two variables have the greatest worst-case concurvity?


library(mgcv)
# Fit the model
mod <- gam(hw.mpg ~ s(length) + s(width) + s(height) + s(weight), 
           data = mpg, method = "REML")

# Check pairwise concurvity
concurvity(mod, full = FALSE)

$worst
                  para    s(length)     s(width)    s(height)    s(weight)
para      1.000000e+00 4.799804e-26 5.458174e-21 4.926340e-23 3.221614e-25
s(length) 4.759962e-26 1.000000e+00 8.336513e-01 6.058015e-01 8.797217e-01
s(width)  5.458344e-21 8.336513e-01 1.000000e+00 4.099837e-01 8.953662e-01
s(height) 4.927251e-23 6.058015e-01 4.099837e-01 1.000000e+00 3.665831e-01
s(weight) 3.233688e-25 8.797217e-01 8.953662e-01 3.665831e-01 1.000000e+00

$observed
                  para    s(length)     s(width)    s(height)    s(weight)
para      1.000000e+00 1.128295e-29 4.467995e-32 9.887661e-34 6.730965e-31
s(length) 4.759962e-26 1.000000e+00 7.511142e-01 2.827977e-01 8.232449e-01
s(width)  5.458344e-21 5.077384e-01 1.000000e+00 1.186126e-01 7.813743e-01
s(height) 4.927251e-23 2.284116e-01 3.313152e-01 1.000000e+00 2.900361e-01
s(weight) 3.233688e-25 6.052819e-01 7.863555e-01 1.494913e-01 1.000000e+00

$estimate
                  para    s(length)     s(width)    s(height)    s(weight)
para      1.000000e+00 1.564968e-28 1.740649e-23 3.448567e-25 1.481483e-27
s(length) 4.759962e-26 1.000000e+00 6.415191e-01 2.271285e-01 7.209033e-01
s(width)  5.458344e-21 6.477497e-01 1.000000e+00 1.054762e-01 7.241891e-01
s(height) 4.927251e-23 3.303484e-01 2.644827e-01 1.000000e+00 2.669300e-01
s(weight) 3.233688e-25 7.235198e-01 6.913221e-01 1.390568e-01 1.000000e+00

Which two variables have the greatest worst-case concurvity?
--> weight and width

weight and width have worst-case concurvity of about 0.895. 

-----------------------------------------------------------------------------------------------------------