Significance and linearity (I) It's time for you to summarize a model and interpret the output. * Summarize the mod_city4 model > str(mpg) 'data.frame': 205 obs. of 26 variables: $ symbol : int 3 3 1 2 2 2 1 1 1 0 ... $ loss : int NA NA NA 164 164 NA 158 NA 158 NA ... $ make : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ... $ fuel : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ... $ aspir : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ... $ doors : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ... $ style : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ... $ drive : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ... $ eng.loc : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ... $ wb : num 88.6 88.6 94.5 99.8 99.4 ... $ length : num 169 169 171 177 177 ... $ width : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ... $ height : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ... $ weight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ... $ eng.type : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ... $ cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ... $ eng.cc : int 130 130 152 109 136 136 136 136 131 131 ... $ fuel.sys : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ... $ bore : num 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ... $ stroke : num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ... $ comp.ratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ... $ hp : int 111 111 154 102 115 110 110 110 140 160 ... $ rpm : int 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ... $ city.mpg : int 21 21 19 24 18 19 19 19 17 16 ... $ hw.mpg : int 27 27 26 30 22 25 25 25 20 22 ... $ price : int 13495 16500 16500 13950 17450 15250 17710 18920 23875 NA ... library(mgcv) # Fit the model mod_city4 <- gam(city.mpg ~ s(weight) + s(length) + s(price) + s(rpm) + s(width), data = mpg, method = "REML") # View the summary summary(mod_city4) Family: gaussian Link function: identity Formula: city.mpg ~ s(weight) + s(length) + s(price) + s(rpm) + s(width) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 25.201 0.188 134 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(weight) 5.620 6.799 17.524 < 2e-16 *** s(length) 2.943 3.759 0.904 0.421 s(price) 1.000 1.000 16.647 6.79e-05 *** s(rpm) 7.751 8.499 16.486 < 2e-16 *** s(width) 1.003 1.005 0.006 0.954 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.831 Deviance explained = 84.7% -REML = 496.47 Scale est. = 7.0365 n = 199 Which smooth term in this model is significant and linear? --> price is significant (p <0.05) and linear (edf near 1). Which smooth term is non-significant and non-linear? --> length ----------------------------------------------------------------------------------------------------------- Plotting the motorcycle crash model and data For our first plotting exercise, we'll add partial residuals to the partial effect plot of a GAM so as to compare the model to the data. * Plot the model (mod) that uses mcycle data. Add partial residuals to the plot. * Make a second plot, making partial residuals more visible by changing the shape using the pch argument, and size of the residuals using the cex argument. Set both the pch and cex arguments to 1. > str(mcycle) 'data.frame': 133 obs. of 2 variables: $ times: num 2.4 2.6 3.2 3.6 4 6.2 6.6 6.8 7.8 8.2 ... $ accel: num 0 -1.3 -2.7 0 -2.7 -2.7 -2.7 -1.3 -2.7 -2.7 ... library(mgcv) # Fit the model mod <- gam(accel ~ s(times), data = mcycle, method = "REML") # Make the plot with residuals plot(mod, residuals = TRUE) # Change shape of residuals plot(mod, residuals = TRUE, pch = 1, cex = 1) Plotting residuals helps you understand the quality of your model fit ----------------------------------------------------------------------------------------------------------- Plotting multiple auto performance variables In plotting GAMs, you sometimes want to look at just parts of a model, or all the terms in model. Here you'll practice selecting which terms to visualize. library(mgcv) # Fit the model mod <- gam(hw.mpg ~ s(weight) + s(rpm) + s(price) + comp.ratio, data = mpg, method = "REML") # Plot the price effect plot(mod, select = c(3)) # Plot all effects plot(mod, pages = 1, all.terms = TRUE) A plot like this gives a good overview of everything going on in the model. ----------------------------------------------------------------------------------------------------------- Visualizing auto performance uncertainty Confidence intervals are a very important visual indicator of model fit. Here you'll practice changing the appearance of confidence intervals and transforming the scale of partial effects plots. * Plot the model (mod) that uses the mpg data, plotting only the partial effect of weight. Make the confidence interval shaded and "hotpink" in color. * Make another plot of the weight partial effect, this time shifting the scale by the value of the intercept using the shift argument, and including the uncertainty of the model intercept using the seWithMean argument. library(mgcv) # Fit the model mod <- gam(hw.mpg ~ s(weight) + s(rpm) + s(price) + comp.ratio, data = mpg, method = "REML") # Plot the weight effect with colored shading plot(mod, select = 1, shade = TRUE, shade.col = "hotpink") # Make another plot adding the intercept value and uncertainty plot(mod, select = 1, seWithMean = TRUE, shift = coef(mod)[1]) This plot is very helpful for explaining or interpreting models results. Incorporating the intercept into the partial effect plot makes the Y-axis naturally interpretable. ----------------------------------------------------------------------------------------------------------- Reading model diagnostics gam.check() helps you understand whether you have enough basis functions to model the data. > str(dat) 'data.frame': 200 obs. of 10 variables: $ y : num 11.17 2.81 12.9 5.68 5.58 ... $ x0: num 0.897 0.266 0.372 0.573 0.908 ... $ x1: num 0.783 0.268 0.219 0.517 0.269 ... $ x2: num 0.148 0.659 0.185 0.954 0.898 ... $ x3: num 0.45 0.814 0.929 0.147 0.75 ... $ f : num 11.95 6.45 11.58 4.76 2.35 ... $ f0: num 0.638 1.481 1.841 1.948 0.569 ... $ f1: num 4.79 1.71 1.55 2.81 1.71 ... $ f2: num 6.52548 3.26245 8.18914 0.00108 0.06946 ... $ f3: num 0 0 0 0 0 0 0 0 0 0 ... library(mgcv) # Fit the model mod <- gam(y ~ s(x0, k = 5) + s(x1, k = 5) + s(x2, k = 5) + s(x3, k = 5), data = dat, method = "REML") # Run the check function gam.check(mod) Method: REML Optimizer: outer newton full convergence after 10 iterations. Gradient range [-0.0001190691,0.0001259251] (score 460.9549 & scale 5.229925). Hessian positive definite, eigenvalue range [0.0001190726,97.53256]. Model rank = 17 / 17 Basis dimension (k) checking results. Low p-value (k-index<1) may indicate that k is too low, especially if edf is close to k'. k' edf k-index p-value s(x0) 4.00 2.54 0.99 0.410 s(x1) 4.00 2.25 0.95 0.300 s(x2) 4.00 3.94 0.84 0.015 * s(x3) 4.00 1.00 0.99 0.435 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Which smooths do not have sufficient numbers of basis functions? --> x2 does not have enough basis functions because it has a significant result in the diagnostic test. ----------------------------------------------------------------------------------------------------------- Fixing problems with model diagnostics You can use gam.check() to improve models by updating them based on its results. * Run the model diagnostics on mod. * Based on the diagnostics, re-fit the model as mod2, changing the number of basis functions (k) for failing smooths. * Run the model diagnostics on mod2 to ensure you have fixed the issue. library(mgcv) # Fit the model mod <- gam(y ~ s(x0, k = 3) + s(x1, k = 3) + s(x2, k = 3) + s(x3, k = 3), data = dat, method = "REML") # Check the diagnostics gam.check(mod) Method: REML Optimizer: outer newton full convergence after 10 iterations. Gradient range [-2.133524e-06,2.107185e-06] (score 1262.659 & scale 3.777109). Hessian positive definite, eigenvalue range [0.2592108,297.5029]. Model rank = 9 / 9 Basis dimension (k) checking results. Low p-value (k-index<1) may indicate that k is too low, especially if edf is close to k'. k' edf k-index p-value s(x0) 2.00 1.99 0.98 0.32 s(x1) 2.00 1.95 1.07 0.94 s(x2) 2.00 2.00 0.11 <2e-16 *** s(x3) 2.00 1.72 1.03 0.78 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # Refit to fix issues mod2 <- gam(y ~ s(x0, k = 3) + s(x1, k = 3) + s(x2, k = 12) + s(x3, k = 3), data = dat, method = "REML") # Check the new model gam.check(mod2) Method: REML Optimizer: outer newton full convergence after 9 iterations. Gradient range [-3.827017e-08,3.651908e-08] (score 615.861 & scale 0.3924265). Hessian positive definite, eigenvalue range [0.3995874,297.5842]. Model rank = 18 / 18 Basis dimension (k) checking results. Low p-value (k-index<1) may indicate that k is too low, especially if edf is close to k'. k' edf k-index p-value s(x0) 2.00 2.00 1.07 0.96 s(x1) 2.00 2.00 0.97 0.20 s(x2) 11.00 10.79 1.01 0.56 s(x3) 2.00 1.89 0.98 0.29 As you can see, model-fitting is often an iterative process of fitting and checking. ----------------------------------------------------------------------------------------------------------- Examining overall concurvity in auto data Let's take a look at concurvity in the fuel efficiency model variables. library(mgcv) # Fit the model mod <- gam(hw.mpg ~ s(length) + s(width) + s(height) + s(weight), data = mpg, method = "REML") # Check overall concurvity concurvity(mod, full = TRUE) para s(length) s(width) s(height) s(weight) worst 1.079374e-20 0.9303404 0.9322887 0.6723705 0.9603887 observed 1.079374e-20 0.7534619 0.8757513 0.4869308 0.8793300 estimate 1.079374e-20 0.8353324 0.7943374 0.4452676 0.8567519 Which smooth is least pre-determined by all the other variables? --> s(height) height has relatively low concurvity. It isn't too similar to any of the other variables. ----------------------------------------------------------------------------------------------------------- Examining concurvity between auto variables Now, let's look at concurvity between model variables. After examining the pairwise concurvity between variables in mod, answer the following question: Which two variables have the greatest worst-case concurvity? library(mgcv) # Fit the model mod <- gam(hw.mpg ~ s(length) + s(width) + s(height) + s(weight), data = mpg, method = "REML") # Check pairwise concurvity concurvity(mod, full = FALSE) $worst para s(length) s(width) s(height) s(weight) para 1.000000e+00 4.799804e-26 5.458174e-21 4.926340e-23 3.221614e-25 s(length) 4.759962e-26 1.000000e+00 8.336513e-01 6.058015e-01 8.797217e-01 s(width) 5.458344e-21 8.336513e-01 1.000000e+00 4.099837e-01 8.953662e-01 s(height) 4.927251e-23 6.058015e-01 4.099837e-01 1.000000e+00 3.665831e-01 s(weight) 3.233688e-25 8.797217e-01 8.953662e-01 3.665831e-01 1.000000e+00 $observed para s(length) s(width) s(height) s(weight) para 1.000000e+00 1.128295e-29 4.467995e-32 9.887661e-34 6.730965e-31 s(length) 4.759962e-26 1.000000e+00 7.511142e-01 2.827977e-01 8.232449e-01 s(width) 5.458344e-21 5.077384e-01 1.000000e+00 1.186126e-01 7.813743e-01 s(height) 4.927251e-23 2.284116e-01 3.313152e-01 1.000000e+00 2.900361e-01 s(weight) 3.233688e-25 6.052819e-01 7.863555e-01 1.494913e-01 1.000000e+00 $estimate para s(length) s(width) s(height) s(weight) para 1.000000e+00 1.564968e-28 1.740649e-23 3.448567e-25 1.481483e-27 s(length) 4.759962e-26 1.000000e+00 6.415191e-01 2.271285e-01 7.209033e-01 s(width) 5.458344e-21 6.477497e-01 1.000000e+00 1.054762e-01 7.241891e-01 s(height) 4.927251e-23 3.303484e-01 2.644827e-01 1.000000e+00 2.669300e-01 s(weight) 3.233688e-25 7.235198e-01 6.913221e-01 1.390568e-01 1.000000e+00 Which two variables have the greatest worst-case concurvity? --> weight and width weight and width have worst-case concurvity of about 0.895. -----------------------------------------------------------------------------------------------------------