Modeling soil pollution in the Netherlands Let's take a closer look at the meuse data and use it to fit your first 2-D model. Unless otherwise mentioned, assume that mgcv is loaded in your workspace going forward. After reviewing the model you will fit, answer the following question: How many basis functions are used in the two-dimensional smooth? * Inspect the meuse data with the head() and str() functions. * Fit a GAM model to the data that predicts the concentration of cadmium in the soil using an interaction of x and y coordinates. * Inspect your model with the summary() and coef() functions. head(meuse) x y cadmium copper lead zinc elev dist om ffreq soil lime 1 181072 333611 11.7 85 299 1022 7.909 0.00135803 13.6 1 1 1 2 181025 333558 8.6 81 277 1141 6.983 0.01222430 14.0 1 1 1 3 181165 333537 6.5 68 199 640 7.800 0.10302900 13.0 1 1 1 4 181298 333484 2.6 81 116 257 7.655 0.19009400 8.0 1 2 0 5 181307 333330 2.8 48 117 269 7.480 0.27709000 8.7 1 2 0 6 181390 333260 3.0 61 137 281 7.791 0.36406700 7.8 1 2 0 landuse dist.m 1 Ah 50 2 Ah 30 3 Ah 150 4 Ga 270 5 Ah 380 6 Ga 470 str(meuse) 'data.frame': 155 obs. of 14 variables: $ x : num 181072 181025 181165 181298 181307 ... $ y : num 333611 333558 333537 333484 333330 ... $ cadmium: num 11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ... $ copper : num 85 81 68 81 48 61 31 29 37 24 ... $ lead : num 299 277 199 116 117 137 132 150 133 80 ... $ zinc : num 1022 1141 640 257 269 ... $ elev : num 7.91 6.98 7.8 7.66 7.48 ... $ dist : num 0.00136 0.01222 0.10303 0.19009 0.27709 ... $ om : num 13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ... $ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ... $ soil : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ... $ lime : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ... $ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ... $ dist.m : num 50 30 150 270 380 470 240 120 240 420 ... # Fit the 2-D model mod2d <- gam(cadmium ~ s(x,y), data = meuse, method = "REML") # Inspect the model summary(mod2d) Family: gaussian Link function: identity Formula: cadmium ~ s(x, y) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.2458 0.1774 18.3 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(x,y) 23.48 27.24 8.667 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.607 Deviance explained = 66.7% -REML = 372.07 Scale est. = 4.8757 n = 155 coef(mod2d) (Intercept) s(x,y).1 s(x,y).2 s(x,y).3 s(x,y).4 s(x,y).5 3.2458065 0.8686658 -10.2154908 6.4161781 -2.6784725 9.1807111 s(x,y).6 s(x,y).7 s(x,y).8 s(x,y).9 s(x,y).10 s(x,y).11 3.7004932 -10.4780937 -8.9821840 -0.6218677 -4.6789789 -5.4267313 s(x,y).12 s(x,y).13 s(x,y).14 s(x,y).15 s(x,y).16 s(x,y).17 7.4996452 8.1962843 -7.6311640 4.5829340 -0.9734724 0.7634059 s(x,y).18 s(x,y).19 s(x,y).20 s(x,y).21 s(x,y).22 s(x,y).23 8.8112827 -4.8639552 -6.8085148 3.8059356 6.3499868 6.4701169 s(x,y).24 s(x,y).25 s(x,y).26 s(x,y).27 s(x,y).28 s(x,y).29 -8.1556061 7.2050985 0.1567317 -53.4384704 -4.2860149 5.5212533 There are 29 coefficients in s(x, y). Surfaces require more basis coefficients to construct than single variable smooths. That's why the default value for k is high for 2-D smooths. ----------------------------------------------------------------------------------------------------------- Adding more variables to predict soil pollution Now let's add additional predictors to the model with spatial interactions. Fit another model to predict cadmium in the soil, this time including smooths for the effect of elevation (elev) and distance from the river (dist) in addition to an x, y surface. # Fit the model mod2da <- gam(cadmium ~ s(x, y) + s(elev) + s(dist), data = meuse, method = "REML") # Inspect the model summary(mod2da) Family: gaussian Link function: identity Formula: cadmium ~ s(x, y) + s(elev) + s(dist) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.2458 0.1238 26.21 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(x,y) 20.398 24.599 2.324 0.000852 *** s(elev) 1.282 1.496 28.868 3.78e-07 *** s(dist) 6.609 7.698 13.677 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.809 Deviance explained = 84.4% -REML = 321.06 Scale est. = 2.3762 n = 155 Models of this form (s(x,y) + s(v1) + ...) are a great way to model spatial data because they incorporate spatial relationships as well as independent predictors. Now let's learn how to visualize them. ----------------------------------------------------------------------------------------------------------- Plotting the model surface Let's explore the different visualization schemes available in mgcv's plot() function. The model you built in the last exercise (mod2da) is available in your workspace. * Plot the interaction terms of mod2da as a contour plot. * Run the plot() function so interaction terms are displayed as 3D surfaces. * Run the plot() function so interaction terms are displayed as colored heat maps on a single page. # Contour plot plot(mod2da, pages = 1) # 3D surface plot plot(mod2da, scheme = 1, pages = 1) # Colored heat map plot(mod2da, scheme = 2, pages = 1) These schemes give you some quick options for plotting interactions as part of models. Now, let's customize some 2-D plots. ----------------------------------------------------------------------------------------------------------- Customizing 3D plots Uncertainty is easy to see in plots of univariate smooths, but more challenging to represent in 2D relationships. Here we'll visualize uncertainty in a geospatial interaction. The model (mod2d) from exercise 2 is available in your workspace. * Use vis.gam() to make a 3D perspective plot of the x, y relationship in the model, using the se argument to make confidence interval surfaces at +/- 2 standard errors. * Make another version of the same plot, rotated 135 degrees to view it from another angle. # Make the perspective plot with error surfaces vis.gam(mod2d, view = c("x", "y"), plot.type = "persp", se = 2) # Rotate the same plot vis.gam(mod2d, view = c("x","y"), plot.type = "persp", se = 2, theta = 135) ----------------------------------------------------------------------------------------------------------- Extrapolation in surface plots When making predictions from the models, it is important to understand how far from the range of your data you are extrapolating. With multivariate smooths, the shape of the areas supported by data may be complex. Here you'll make plots that compare extrapolations to support in the data. The model (mod2d) from exercise 2 is available in your workspace. * Make a contour plot of the model using vis.gam(), extrapolating out from the data 5%. * Make a contour plot of the model using vis.gam(), extrapolating out from the data 10%. * Make a contour plot of the model using vis.gam(), extrapolating out from the data 25%. Overlay the meuse data on top of your visualization as points. # Make plot with 5% extrapolation vis.gam(mod2d, view = c("x", "y"), plot.type = "contour", too.far = 0.05) # Overlay data points(meuse) # Make plot with 10% extrapolation vis.gam(mod2d, view = c("x", "y"), plot.type = "contour", too.far = 0.1) # Overlay data points(meuse) # Make plot with 25% extrapolation vis.gam(mod2d, view = c("x", "y"), plot.type = "contour", too.far = 0.25) # Overlay data points(meuse) You can see from the progression of plots how extrapolation changes the impression of the extent of model support. When presenting such model results, it's always important to be aware of the ranges of data supporting the model. ----------------------------------------------------------------------------------------------------------- Soil pollution in different land uses The meuse data set has a factor variable, landuse, which specifies the type of land use or cover at the location where soil was sampled. * Using a categorical-continuous interaction (e.g., the by = form), fit a model to the meuse data that predicts copper levels as a function of dist, with different smooths for each level of landuse. * Include a separate term for varying intercepts for each level of landuse. * Print the model summary. * Fit a model with a factor-smooth interaction between dist and landuse variables using the bs = "fs" formulation. * Print the model summary. # Fit a model with separate smooths for each land-use level mod_sep <- gam(copper ~ s(dist, by = landuse) + landuse, data = meuse, method = "REML") # Examine the summary summary(mod_sep) Family: gaussian Link function: identity Formula: copper ~ s(dist, by = landuse) + landuse Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20.115 46.335 0.434 0.665 landuseAb 15.674 46.607 0.336 0.737 landuseAg 6.670 46.825 0.142 0.887 landuseAh 18.137 46.395 0.391 0.697 landuseAm 13.364 46.438 0.288 0.774 landuseB 0.825 62.979 0.013 0.990 landuseBw 18.909 46.755 0.404 0.687 landuseDEN 0.000 0.000 NaN NaN landuseFh 4.885 155.449 0.031 0.975 landuseFw 18.578 46.571 0.399 0.691 landuseGa 122.541 99.225 1.235 0.219 landuseSPO 1.885 146.511 0.013 0.990 landuseSTA 3.261 49.282 0.066 0.947 landuseTv 4.885 134.319 0.036 0.971 landuseW 18.917 46.441 0.407 0.684 Approximate significance of smooth terms: edf Ref.df F p-value s(dist):landuseAa 1.000e+00 1.000e+00 0.000 0.98367 s(dist):landuseAb 1.000e+00 1.000e+00 1.672 0.19845 s(dist):landuseAg 1.000e+00 1.000e+00 0.442 0.50765 s(dist):landuseAh 2.524e+00 3.113e+00 8.725 2.28e-05 *** s(dist):landuseAm 1.000e+00 1.000e+00 6.802 0.01025 * s(dist):landuseB 1.000e+00 1.000e+00 0.023 0.87892 s(dist):landuseBw 1.000e+00 1.000e+00 0.047 0.82898 s(dist):landuseDEN 1.000e+00 1.000e+00 0.042 0.83865 s(dist):landuseFh 6.401e-16 6.401e-16 0.000 1.00000 s(dist):landuseFw 2.696e+00 3.304e+00 5.292 0.00134 ** s(dist):landuseGa 1.882e+00 1.986e+00 0.654 0.49571 s(dist):landuseSPO -9.575e-16 -9.575e-16 0.000 1.00000 s(dist):landuseSTA 1.000e+00 1.000e+00 0.005 0.94455 s(dist):landuseTv 7.553e-16 7.553e-16 0.000 1.00000 s(dist):landuseW 4.009e+00 4.814e+00 32.861 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Rank: 146/150 R-sq.(adj) = 0.634 Deviance explained = 71.1% -REML = 532.81 Scale est. = 199.48 n = 154 # Fit a model with a factor-smooth interaction mod_fs <- gam(copper ~ s(dist, landuse, bs = "fs"), data = meuse, method = "REML") # Examine the summary summary(mod_fs) Family: gaussian Link function: identity Formula: copper ~ s(dist, landuse, bs = "fs") Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 38.168 2.252 16.95 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(dist,landuse) 21.96 72 3.283 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.607 Deviance explained = 66.3% -REML = 650.98 Scale est. = 214.05 n = 154 ----------------------------------------------------------------------------------------------------------- Plotting pollution in different land uses You can observe the differences between different continuous-categorical interaction types by visualizing them. Here, you'll look at the different ways the by-type and factor-smooth-type interactions are plotted, and see how the approaches fit the models differently. Both the models (mod_sep and mod_fs) from the previous exercise are available in your workspace. * Plot both models using the plot() function, using the pages argument to keep all terms on one plot. * Plot both models making 3D perspective plots with vis.gam(). # Plot both the models with plot() plot(mod_sep) plot(mod_fs) # Plot both the models with vis.gam() vis.gam(mod_sep, view = c("dist", "landuse"), plot.type = "persp") vis.gam(mod_fs, view = c("dist", "landuse"), plot.type = "persp") Flip back and forth between the last two plots you made. Note the fs model has fewer wild swings - it is more stable for models where some categories have few data points, though we don't get statistics on the individual smooths in this case. ----------------------------------------------------------------------------------------------------------- Pollution models with multi-scale interactions The meuse dataset contains some predictor variables that are on the same scale (x, y), and some that are on different scales (elev, dist, om). In a previous exercise, you fit a model where you predicted cadmium pollution as a function of location and elevation: mod <- gam(cadmium ~ s(x, y) + s(elev), data = meuse, method = "REML") In this exercise, you'll build a model that allows multiple variables to interact despite these different scales using a tensor smooth, te(). * Convert this to a model where x, y, and elev all interact in a single te() term, varying on their own scales. * Then summarize the model and visualize it with plot(). # Fit the model tensor_mod <- gam(cadmium ~ te(x, y, elev), data = meuse, method = "REML") # Summarize and plot summary(tensor_mod) plot(tensor_mod) Family: gaussian Link function: identity Formula: cadmium ~ te(x, y, elev) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.2458 0.1329 24.43 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value te(x,y,elev) 38.29 45.86 11.87 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.78 Deviance explained = 83.4% -REML = 318.09 Scale est. = 2.7358 n = 155 Look at the output plot for your 3-variable tensor model. To visulize a 3-variable interaction, mgcv plots 2D “slices” of the 2-way interaction at different levels of the third variable. ----------------------------------------------------------------------------------------------------------- Teasing apart multi-scale interactions In the previous exercise, you fit the following model: tensor_mod <- gam(cadmium ~ te(x, y, elev), data = meuse, method = "REML") In this exercise, you will fit a model with smooths and tensor interactions to separate out the independent and interacting effects of variables. * Convert the above model such that x and y interact on the same scale, the effect elev is a separate smooth, and the interaction of all three on different scales is a separate term. * Summarize and plot the model. # Fit the model tensor_mod2 <- gam(cadmium ~ s(x,y) + s(elev) + ti(x, y, elev), data = meuse, method = "REML") # Summarize and plot summary(tensor_mod2) plot(tensor_mod2, pages = 1) Family: gaussian Link function: identity Formula: cadmium ~ s(x, y) + s(elev) + ti(x, y, elev) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.7044 0.2244 12.05 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(x,y) 21.812 25.491 6.386 < 2e-16 *** s(elev) 3.898 4.688 9.680 5.79e-07 *** ti(x,y,elev) 14.656 19.180 2.706 0.000569 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.793 Deviance explained = 84.7% -REML = 336.62 Scale est. = 2.5755 n = 155 Look at your plots. You can see how horizontal space elevation have independent effects, but we still capture the interaction of all the terms.