Modeling soil pollution in the Netherlands

Let's take a closer look at the meuse data and use it to fit your first 2-D model.

Unless otherwise mentioned, assume that mgcv is loaded in your workspace going forward.

After reviewing the model you will fit, answer the following question: How many basis functions 
are used in the two-dimensional smooth?

* Inspect the meuse data with the head() and str() functions.
* Fit a GAM model to the data that predicts the concentration of cadmium in the soil using 
  an interaction of x and y coordinates.
* Inspect your model with the summary() and coef() functions.

 
head(meuse)
       x      y cadmium copper lead zinc  elev       dist   om ffreq soil lime
1 181072 333611    11.7     85  299 1022 7.909 0.00135803 13.6     1    1    1
2 181025 333558     8.6     81  277 1141 6.983 0.01222430 14.0     1    1    1
3 181165 333537     6.5     68  199  640 7.800 0.10302900 13.0     1    1    1
4 181298 333484     2.6     81  116  257 7.655 0.19009400  8.0     1    2    0
5 181307 333330     2.8     48  117  269 7.480 0.27709000  8.7     1    2    0
6 181390 333260     3.0     61  137  281 7.791 0.36406700  7.8     1    2    0
  landuse dist.m
1      Ah     50
2      Ah     30
3      Ah    150
4      Ga    270
5      Ah    380
6      Ga    470

str(meuse)
'data.frame':	155 obs. of  14 variables:
 $ x      : num  181072 181025 181165 181298 181307 ...
 $ y      : num  333611 333558 333537 333484 333330 ...
 $ cadmium: num  11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
 $ copper : num  85 81 68 81 48 61 31 29 37 24 ...
 $ lead   : num  299 277 199 116 117 137 132 150 133 80 ...
 $ zinc   : num  1022 1141 640 257 269 ...
 $ elev   : num  7.91 6.98 7.8 7.66 7.48 ...
 $ dist   : num  0.00136 0.01222 0.10303 0.19009 0.27709 ...
 $ om     : num  13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
 $ ffreq  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ soil   : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
 $ lime   : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
 $ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
 $ dist.m : num  50 30 150 270 380 470 240 120 240 420 ...
 
# Fit the 2-D model
mod2d <- gam(cadmium ~ s(x,y), data = meuse, method = "REML")

# Inspect the model

summary(mod2d)


Family: gaussian 
Link function: identity 

Formula:
cadmium ~ s(x, y)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.2458     0.1774    18.3   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
         edf Ref.df     F p-value    
s(x,y) 23.48  27.24 8.667  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.607   Deviance explained = 66.7%
-REML = 372.07  Scale est. = 4.8757    n = 155


coef(mod2d)

(Intercept)    s(x,y).1    s(x,y).2    s(x,y).3    s(x,y).4    s(x,y).5 
  3.2458065   0.8686658 -10.2154908   6.4161781  -2.6784725   9.1807111 
   s(x,y).6    s(x,y).7    s(x,y).8    s(x,y).9   s(x,y).10   s(x,y).11 
  3.7004932 -10.4780937  -8.9821840  -0.6218677  -4.6789789  -5.4267313 
  s(x,y).12   s(x,y).13   s(x,y).14   s(x,y).15   s(x,y).16   s(x,y).17 
  7.4996452   8.1962843  -7.6311640   4.5829340  -0.9734724   0.7634059 
  s(x,y).18   s(x,y).19   s(x,y).20   s(x,y).21   s(x,y).22   s(x,y).23 
  8.8112827  -4.8639552  -6.8085148   3.8059356   6.3499868   6.4701169 
  s(x,y).24   s(x,y).25   s(x,y).26   s(x,y).27   s(x,y).28   s(x,y).29 
 -8.1556061   7.2050985   0.1567317 -53.4384704  -4.2860149   5.5212533 

There are 29 coefficients in s(x, y). Surfaces require more basis coefficients to construct than single
variable smooths. That's why the default value for k is high for 2-D smooths.  

-----------------------------------------------------------------------------------------------------------

Adding more variables to predict soil pollution

Now let's add additional predictors to the model with spatial interactions.

Fit another model to predict cadmium in the soil, this time including smooths for the
effect of elevation (elev) and distance from the river (dist) in addition to an x, y surface.


# Fit the model
mod2da <- gam(cadmium ~ s(x, y) + s(elev) + s(dist), 
              data = meuse, method = "REML")

# Inspect the model
summary(mod2da)


Family: gaussian 
Link function: identity 

Formula:
cadmium ~ s(x, y) + s(elev) + s(dist)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.2458     0.1238   26.21   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
           edf Ref.df      F  p-value    
s(x,y)  20.398 24.599  2.324 0.000852 ***
s(elev)  1.282  1.496 28.868 3.78e-07 ***
s(dist)  6.609  7.698 13.677  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.809   Deviance explained = 84.4%
-REML = 321.06  Scale est. = 2.3762    n = 155


Models of this form (s(x,y) + s(v1) + ...) are a great way to model spatial data because they incorporate
spatial relationships as well as independent predictors. Now let's learn how to visualize them. 

-----------------------------------------------------------------------------------------------------------

Plotting the model surface

Let's explore the different visualization schemes available in mgcv's plot() function.

The model you built in the last exercise (mod2da) is available in your workspace.

* Plot the interaction terms of mod2da as a contour plot.
* Run the plot() function so interaction terms are displayed as 3D surfaces.
* Run the plot() function so interaction terms are displayed as colored heat maps on a single page.

# Contour plot
plot(mod2da, pages = 1)

# 3D surface plot
plot(mod2da, scheme = 1, pages = 1)

# Colored heat map
plot(mod2da, scheme = 2, pages = 1)

These schemes give you some quick options for plotting interactions as part of models. 
Now, let's customize some 2-D plots. 

-----------------------------------------------------------------------------------------------------------

Customizing 3D plots

Uncertainty is easy to see in plots of univariate smooths, but more challenging to represent in 2D
relationships. Here we'll visualize uncertainty in a geospatial interaction.

The model (mod2d) from exercise 2 is available in your workspace.

* Use vis.gam() to make a 3D perspective plot of the x, y relationship in the model, using 
  the se argument to make confidence interval surfaces at +/- 2 standard errors.
* Make another version of the same plot, rotated 135 degrees to view it from another angle.

# Make the perspective plot with error surfaces
vis.gam(mod2d, view = c("x", "y"), 
        plot.type = "persp", se = 2)

# Rotate the same plot
vis.gam(mod2d, view = c("x","y"),
        plot.type = "persp", se = 2,
        theta = 135)

-----------------------------------------------------------------------------------------------------------

Extrapolation in surface plots

When making predictions from the models, it is important to understand how far from the range of your data
you are extrapolating. With multivariate smooths, the shape of the areas supported by data may be complex.
Here you'll make plots that compare extrapolations to support in the data.

The model (mod2d) from exercise 2 is available in your workspace.

* Make a contour plot of the model using vis.gam(), extrapolating out from the data 5%.
* Make a contour plot of the model using vis.gam(), extrapolating out from the data 10%.
* Make a contour plot of the model using vis.gam(), extrapolating out from the data 25%.
  Overlay the meuse data on top of your visualization as points.
  

# Make plot with 5% extrapolation
vis.gam(mod2d, view = c("x", "y"), 
        plot.type = "contour", too.far = 0.05)

# Overlay data
points(meuse)

# Make plot with 10% extrapolation
vis.gam(mod2d, view = c("x", "y"), 
        plot.type = "contour", too.far = 0.1)

# Overlay data
points(meuse)

# Make plot with 25% extrapolation
vis.gam(mod2d, view = c("x", "y"), 
        plot.type = "contour", too.far = 0.25)

# Overlay data
points(meuse)

You can see from the progression of plots how extrapolation changes the impression of the extent of model
support. When presenting such model results, it's always important to be aware of the ranges of data
supporting the model.

-----------------------------------------------------------------------------------------------------------

Soil pollution in different land uses

The meuse data set has a factor variable, landuse, which specifies the type of land use or cover at the
location where soil was sampled.

* Using a categorical-continuous interaction (e.g., the by = form), fit a model to the meuse data that
  predicts copper levels as a function of dist, with different smooths for each level of landuse.
* Include a separate term for varying intercepts for each level of landuse.
* Print the model summary.

* Fit a model with a factor-smooth interaction between dist and landuse variables using the 
  bs = "fs" formulation.
* Print the model summary.


# Fit a model with separate smooths for each land-use level
mod_sep <- gam(copper ~ s(dist, by = landuse) + landuse,
               data = meuse, method = "REML")

# Examine the summary
summary(mod_sep)


Family: gaussian 
Link function: identity 

Formula:
copper ~ s(dist, by = landuse) + landuse

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   20.115     46.335   0.434    0.665
landuseAb     15.674     46.607   0.336    0.737
landuseAg      6.670     46.825   0.142    0.887
landuseAh     18.137     46.395   0.391    0.697
landuseAm     13.364     46.438   0.288    0.774
landuseB       0.825     62.979   0.013    0.990
landuseBw     18.909     46.755   0.404    0.687
landuseDEN     0.000      0.000     NaN      NaN
landuseFh      4.885    155.449   0.031    0.975
landuseFw     18.578     46.571   0.399    0.691
landuseGa    122.541     99.225   1.235    0.219
landuseSPO     1.885    146.511   0.013    0.990
landuseSTA     3.261     49.282   0.066    0.947
landuseTv      4.885    134.319   0.036    0.971
landuseW      18.917     46.441   0.407    0.684

Approximate significance of smooth terms:
                          edf     Ref.df      F  p-value    
s(dist):landuseAa   1.000e+00  1.000e+00  0.000  0.98367    
s(dist):landuseAb   1.000e+00  1.000e+00  1.672  0.19845    
s(dist):landuseAg   1.000e+00  1.000e+00  0.442  0.50765    
s(dist):landuseAh   2.524e+00  3.113e+00  8.725 2.28e-05 ***
s(dist):landuseAm   1.000e+00  1.000e+00  6.802  0.01025 *  
s(dist):landuseB    1.000e+00  1.000e+00  0.023  0.87892    
s(dist):landuseBw   1.000e+00  1.000e+00  0.047  0.82898    
s(dist):landuseDEN  1.000e+00  1.000e+00  0.042  0.83865    
s(dist):landuseFh   6.401e-16  6.401e-16  0.000  1.00000    
s(dist):landuseFw   2.696e+00  3.304e+00  5.292  0.00134 ** 
s(dist):landuseGa   1.882e+00  1.986e+00  0.654  0.49571    
s(dist):landuseSPO -9.575e-16 -9.575e-16  0.000  1.00000    
s(dist):landuseSTA  1.000e+00  1.000e+00  0.005  0.94455    
s(dist):landuseTv   7.553e-16  7.553e-16  0.000  1.00000    
s(dist):landuseW    4.009e+00  4.814e+00 32.861  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Rank: 146/150
R-sq.(adj) =  0.634   Deviance explained = 71.1%
-REML = 532.81  Scale est. = 199.48    n = 154


# Fit a model with a factor-smooth interaction
mod_fs <- gam(copper ~ s(dist, landuse, bs = "fs"),
              data = meuse, method = "REML")

# Examine the summary
summary(mod_fs)


Family: gaussian 
Link function: identity 

Formula:
copper ~ s(dist, landuse, bs = "fs")

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   38.168      2.252   16.95   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                  edf Ref.df     F p-value    
s(dist,landuse) 21.96     72 3.283  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.607   Deviance explained = 66.3%
-REML = 650.98  Scale est. = 214.05    n = 154

-----------------------------------------------------------------------------------------------------------

Plotting pollution in different land uses

You can observe the differences between different continuous-categorical interaction types by visualizing
them. Here, you'll look at the different ways the by-type and factor-smooth-type interactions are plotted,
and see how the approaches fit the models differently.

Both the models (mod_sep and mod_fs) from the previous exercise are available in your workspace.

* Plot both models using the plot() function, using the pages argument to keep all terms on one plot.
* Plot both models making 3D perspective plots with vis.gam().

# Plot both the models with plot()
plot(mod_sep)
plot(mod_fs)

# Plot both the models with vis.gam()
vis.gam(mod_sep, view = c("dist", "landuse"), plot.type = "persp")
vis.gam(mod_fs, view = c("dist", "landuse"), plot.type = "persp")

Flip back and forth between the last two plots you made. Note the fs model has fewer wild swings - it is
more stable for models where some categories have few data points, though we don't get statistics on the
individual smooths in this case.

-----------------------------------------------------------------------------------------------------------

Pollution models with multi-scale interactions

The meuse dataset contains some predictor variables that are on the same scale (x, y), and some that are
on different scales (elev, dist, om). In a previous exercise, you fit a model where you predicted cadmium
pollution as a function of location and elevation:

mod <- gam(cadmium ~ s(x, y) + s(elev), 
           data = meuse, method = "REML")

In this exercise, you'll build a model that allows multiple variables to interact despite these different
scales using a tensor smooth, te().

* Convert this to a model where x, y, and elev all interact in a single te() term, varying on their own
  scales.
* Then summarize the model and visualize it with plot().


# Fit the model
tensor_mod <- gam(cadmium ~ te(x, y, elev), data = meuse, method = "REML")

# Summarize and plot
summary(tensor_mod)
plot(tensor_mod)


Family: gaussian 
Link function: identity 

Formula:
cadmium ~ te(x, y, elev)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.2458     0.1329   24.43   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
               edf Ref.df     F p-value    
te(x,y,elev) 38.29  45.86 11.87  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =   0.78   Deviance explained = 83.4%
-REML = 318.09  Scale est. = 2.7358    n = 155

Look at the output plot for your 3-variable tensor model. To visulize a 3-variable interaction, 
mgcv plots 2D “slices” of the 2-way interaction at different levels of the third variable. 

-----------------------------------------------------------------------------------------------------------

Teasing apart multi-scale interactions

In the previous exercise, you fit the following model:

tensor_mod <- gam(cadmium ~ te(x, y, elev), 
                  data = meuse, method = "REML")

In this exercise, you will fit a model with smooths and tensor interactions to separate out the 
independent and interacting effects of variables.

* Convert the above model such that x and y interact on the same scale, the effect elev is a separate
  smooth, and the interaction of all three on different scales is a separate term.
* Summarize and plot the model.


# Fit the model
tensor_mod2 <- gam(cadmium ~ s(x,y) + s(elev) + ti(x, y, elev),
                   data = meuse, method = "REML")

# Summarize and plot
summary(tensor_mod2)
plot(tensor_mod2, pages = 1)


Family: gaussian 
Link function: identity 

Formula:
cadmium ~ s(x, y) + s(elev) + ti(x, y, elev)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.7044     0.2244   12.05   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                edf Ref.df     F  p-value    
s(x,y)       21.812 25.491 6.386  < 2e-16 ***
s(elev)       3.898  4.688 9.680 5.79e-07 ***
ti(x,y,elev) 14.656 19.180 2.706 0.000569 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.793   Deviance explained = 84.7%
-REML = 336.62  Scale est. = 2.5755    n = 155

Look at your plots. You can see how horizontal space elevation have independent effects, but we still
capture the interaction of all the terms.