Motorcycle crash data: linear approach In this first exercise, you will fit a linear model to a data set and visualize the results to see how well it captures relationships in the data. The data set, stored in a data frame named mcycle, contains measurement of acceleration of a crash-test dummy head during a motorcycle crash. It contains measurements of acceleration (g) in the accel column and time (milliseconds) in the times column. * Use the head() and plot() functions to look at the data frame named mcycle. * Use the lm() function to fit a model to the mcycle data where the accel variable is a linear function of times. * Visualize the model fit using the provided call to termplot(). > glimpse(mcycle) Rows: 133 Columns: 2 $ times 2.4, 2.6, 3.2, 3.6, 4.0, 6.2, 6.6, 6.8, 7.8, 8.2, 8.8, 8.8, 9.6,… $ accel 0.0, -1.3, -2.7, 0.0, -2.7, -2.7, -2.7, -1.3, -2.7, -2.7, -1.3, … # Examine the mcycle data frame head(mcycle) plot(mcycle) # Fit a linear model lm_mod <- lm(accel ~ times, data = mcycle) # Visualize the model termplot(lm_mod, partial.resid = TRUE, se = TRUE) As you can see, the linear model does a poor job fitting this data, which clearly has a nonlinear pattern. ----------------------------------------------------------------------------------------------------------- Motorcycle crash data: non-linear approach Now you will fit a non-linear model to the same mcycle data using the gam() function from the mgcv package. * Load the mgcv package. * Fit a model to the mcycle data where accel has a smooth, nonlinear relation to times using the gam() function. * Visualize the model fit using the provided call to plot(). # Load mgcv library(mgcv) # Fit the model gam_mod <- gam(accel ~ s(times), data = mcycle) # Plot the results plot(gam_mod, residuals = TRUE, pch = 1) The GAM model is much better at fitting these data because it can capture the nonlinear relationships between the variables. ----------------------------------------------------------------------------------------------------------- Parts of non-linear function GAMs are made up of basis functions that together compose the smooth terms in models. You can use the coef() function to extract the coefficients from the GAM model object. After extracting the coefficients, answer the question: How many coefficients are there for the basis functions that compose the smooth curve in this model? * Extract the model coefficients of the gam_mod object which is available in your workspace. # Extract the model coefficients coef(gam_mod) (Intercept) s(times).1 s(times).2 s(times).3 s(times).4 s(times).5 -25.545865 -63.718008 43.475644 -110.350132 -22.181006 35.034423 s(times).6 s(times).7 s(times).8 s(times).9 93.176458 -9.283018 -111.661472 17.603782 The smooth is made up of 9 basis functions, each with their own coefficient. A GAM with just two variables can have many coefficients, which means they require a bit more data than linear models. ----------------------------------------------------------------------------------------------------------- Setting complexity of the motorcycle model The number of basis functions in a smooth has a great impact on the shapes a model can take. Here, you'll practice modifying the number of basis functions in a model and examining the results. * Fit a GAM with 3 basis functions to the mcycle data, with accel as a smooth function of times. * Fit the same GAM again, but this time with 20 basis functions. * Use the provided plot() functions to visualize both models. library(mgcv) # Fit a GAM with 3 basis functions gam_mod_k3 <- gam(accel ~ s(times, k = 3), data = mcycle) # Fit with 20 basis functions gam_mod_k20 <- gam(accel ~ s(times, k = 20), data = mcycle) # Visualize the GAMs par(mfrow = c(1, 2)) plot(gam_mod_k3, residuals = TRUE, pch = 1) plot(gam_mod_k20, residuals = TRUE, pch = 1) See how the GAM with only three basis functions is much simpler, and doesn't model the complexity in the data. 20 basis functions, though, is enough to capture this relationship. ----------------------------------------------------------------------------------------------------------- Using smoothing parameters to avoid overfitting The smoothing parameter balances between likelihood and wiggliness to optimize model fit. Here, you'll examine smoothing parameters and will fit models with different fixed smoothing parameters. * View the value of the smoothing parameter (lambda) of the provided gam_mod model by extracting the sp value from the model. * Fit two models to the mcycle data with accel as a smooth function of times and a smoothing parameter of: 0.01 and 0.0001 * Visualize both models. library(mgcv) # Extract the smoothing parameter gam_mod <- gam(accel ~ s(times), data = mcycle, method = "REML") gam_mod$sp # Fix the smoothing parameter at 0.1 gam_mod_s1 <- gam(accel ~ s(times), data = mcycle, sp = 0.1) # Fix the smoothing parameter at 0.0001 gam_mod_s2 <- gam(accel ~ s(times), data = mcycle, sp = 0.0001) # Plot both models par(mfrow = c(2, 1)) plot(gam_mod_s1, residuals = TRUE, pch = 1) plot(gam_mod_s2, residuals = TRUE, pch = 1) See how the model with sp = 0.1 is much smoother, while the one with sp = 0.0001 fits most wiggles in the data. Note that these both have the same number of basis functions - the smoothing parameter sp calibrates the overall complexity with the bounds allowed by k. ----------------------------------------------------------------------------------------------------------- Complexity and smoothing together The number of basis functions and the smoothing parameters interact to control the wiggliness of a smooth function. Here you will see how changing both together affects model behavior. * Fit a GAM models to the mcycle data with accel as a smooth function of times with 50 basis functions and a smoothing parameter of 0.0001. * Visualize the model. library(mgcv) # Fit the GAM gam_mod_sk <- gam(accel ~ s(times, k = 50), sp = 0.0001, data = mcycle) #Visualize the model plot(gam_mod_sk, residuals = TRUE, pch = 1) Look how wiggly that model is! When k is high and sp is too low to smooth the model, it ends up overfitting the data. ----------------------------------------------------------------------------------------------------------- Multivariate GAMs of auto performance GAMs can accept multiple variables of different types. In the following exercises, you'll work with the mpg dataset available in the gamair package to practice fitting models of different forms. * Use the head() and str() functions to examine the mpg data set. * Fit a GAM to these data to predict city.mpg as the sum of smooth functions of weight, length, and price. * Use the plot() function provided to visualize the model. > glimpse(mpg) Rows: 205 Columns: 26 $ symbol 3, 3, 1, 2, 2, 2, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0, 2, 1,… $ loss NA, NA, NA, 164, 164, NA, 158, NA, 158, NA, 192, 192, 188, … $ make alfa-romero, alfa-romero, alfa-romero, audi, audi, audi, au… $ fuel gas, gas, gas, gas, gas, gas, gas, gas, gas, gas, gas, gas,… $ aspir std, std, std, std, std, std, std, std, turbo, turbo, std, … $ doors two, two, two, four, four, two, four, four, four, two, two,… $ style convertible, convertible, hatchback, sedan, sedan, sedan, s… $ drive rwd, rwd, rwd, fwd, 4wd, fwd, fwd, fwd, fwd, 4wd, rwd, rwd,… $ eng.loc front, front, front, front, front, front, front, front, fro… $ wb 88.6, 88.6, 94.5, 99.8, 99.4, 99.8, 105.8, 105.8, 105.8, 99… $ length 168.8, 168.8, 171.2, 176.6, 176.6, 177.3, 192.7, 192.7, 192… $ width 64.1, 64.1, 65.5, 66.2, 66.4, 66.3, 71.4, 71.4, 71.4, 67.9,… $ height 48.8, 48.8, 52.4, 54.3, 54.3, 53.1, 55.7, 55.7, 55.9, 52.0,… $ weight 2548, 2548, 2823, 2337, 2824, 2507, 2844, 2954, 3086, 3053,… $ eng.type dohc, dohc, ohcv, ohc, ohc, ohc, ohc, ohc, ohc, ohc, ohc, o… $ cylinders four, four, six, four, five, five, five, five, five, five, … $ eng.cc 130, 130, 152, 109, 136, 136, 136, 136, 131, 131, 108, 108,… $ fuel.sys mpfi, mpfi, mpfi, mpfi, mpfi, mpfi, mpfi, mpfi, mpfi, mpfi,… $ bore 3.47, 3.47, 2.68, 3.19, 3.19, 3.19, 3.19, 3.19, 3.13, 3.13,… $ stroke 2.68, 2.68, 3.47, 3.40, 3.40, 3.40, 3.40, 3.40, 3.40, 3.40,… $ comp.ratio 9.00, 9.00, 9.00, 10.00, 8.00, 8.50, 8.50, 8.50, 8.30, 7.00… $ hp 111, 111, 154, 102, 115, 110, 110, 110, 140, 160, 101, 101,… $ rpm 5000, 5000, 5000, 5500, 5500, 5500, 5500, 5500, 5500, 5500,… $ city.mpg 21, 21, 19, 24, 18, 19, 19, 19, 17, 16, 23, 23, 21, 21, 20,… $ hw.mpg 27, 27, 26, 30, 22, 25, 25, 25, 20, 22, 29, 29, 28, 28, 25,… $ price 13495, 16500, 16500, 13950, 17450, 15250, 17710, 18920, 238… library(mgcv) # Examine the data head(mpg) str(mpg) # Fit the model mod_city <- gam(city.mpg ~ s(weight) + s(length) + s(price), data = mpg, method = "REML") # Plot the model plot(mod_city, pages = 1) Look at the plot for the price term - it's completely linear. Automatic smoothing found that a linear relationship was a better fit than anything wiggly. ----------------------------------------------------------------------------------------------------------- Adding categorical to the auto performance model Now you'll include categorical variables in your model. Categories are inherently linear, so you'll model them as linear terms. * Fit a GAM to the mpg data, modeling city.mpg as a sum of smooth functions of weight, length, and price, and also include the categorical terms fuel, drive, and style. * Use the plot() function provided to visualize the model. library(mgcv) # Fit the model mod_city2 <- gam(city.mpg ~ s(weight) + s(length) + s(price) + fuel + drive + style, data = mpg, method = "REML") # Plot the model plot(mod_city2, all.terms = TRUE, pages = 1) You've fit models with multiple smooth terms and models with smooth and categorical terms. Now let's fit a model with smooths that vary by category. ----------------------------------------------------------------------------------------------------------- Category-level smooths for different auto types Now you extend your models to include different smooths for different levels of categorical terms. * Fit a model to predict city fuel efficiency (city.mpg) with smooth terms of weight, length, and price, but make each of these smooth terms depend on the drive categorical variable using by= in the smooth terms. * Include a separate linear term for the drive variable. library(mgcv) # Fit the model mod_city3 <- gam(city.mpg ~ s(weight, by = drive) + s(length, by = drive) + s(price, by = drive) + drive, data = mpg, method = "REML") # Plot the model plot(mod_city3, pages = 1) You're already fitting some very complex models. In the next chapter, you'll learn about interpreting and visualizing your results.