Introduction

spmodel is an package used to fit, summarize, and predict for a variety of spatial statistical models. The vignette provides an introduction spatial generalized linear models for non-Gaussian response distributions in spmodel. Before proceeding, we load spmodel by running

If using spmodel in a formal publication or report, please cite it. Citing spmodel lets us devote more resources to the package in the future. We view the spmodel citation by running

citation(package = "spmodel")
#> To cite spmodel in publications use:
#> 
#>   Dumelle M, Higham M, Ver Hoef JM (2023). spmodel: Spatial statistical
#>   modeling and prediction in R. PLOS ONE 18(3): e0282524.
#>   https://doi.org/10.1371/journal.pone.0282524
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {{spmodel}: Spatial statistical modeling and prediction in {R}},
#>     author = {Michael Dumelle and Matt Higham and Jay M. {Ver Hoef}},
#>     journal = {PLOS ONE},
#>     year = {2023},
#>     volume = {18},
#>     number = {3},
#>     pages = {1--32},
#>     doi = {10.1371/journal.pone.0282524},
#>     url = {https://doi.org/10.1371/journal.pone.0282524},
#>   }

We will create visualizations using ggplot2 (Wickham 2016), which we load by running

ggplot2 is only installed alongside spmodel when dependencies = TRUE in install.packages(), so check that the package is installed and loaded before reproducing any of these vignette’s visualizations. We will also show code that can be used to create interactive visualizations of spatial data with mapview (Appelhans et al. 2022). mapview also has many backgrounds available that contextualize spatial data with topographical information. Before running the mapview code interactively, make sure mapview is installed and loaded.

spmodel contains various methods for generic functions defined outside of spmodel. To find relevant documentation for these methods, run help("generic.spmodel", "spmodel") (e.g., help("summary.spmodel", "spmodel"), help("predict.spmodel", "spmodel"), etc.). Note that ?generic.spmodel is shorthand for help("generic.spmodel", "spmodel"). We provide more details and examples regarding these methods and generics throughout this vignette. For a full list of spmodel functions available, see spmodel’s documentation manual.

A Review of the Spatial Linear Model

We first focus on spatial linear models for a Gaussian response vector. The spatial linear model is defined as \[\begin{equation}\label{eq:splm} \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon}, \end{equation}\] where for a sample size \(n\), \(\mathbf{y}\) is an \(n \times 1\) column vector of response variables, \(\mathbf{X}\) is an \(n \times p\) design (model) matrix of explanatory variables, \(\boldsymbol{\beta}\) is a \(p \times 1\) column vector of fixed effects controlling the impact of \(\mathbf{X}\) on \(\mathbf{y}\), \(\boldsymbol{\tau}\) is an \(n \times 1\) column vector of spatially dependent random errors, and \(\boldsymbol{\epsilon}\) is an \(n \times 1\) column vector of spatially independent random errors. We make a few assumptions about \(\boldsymbol{\tau}\) and \(\boldsymbol{\epsilon}\) in the spatial linear model: first, that \(\text{E}(\boldsymbol{\tau}) = \text{E}(\boldsymbol{\epsilon}) = \boldsymbol{0}\), where \(\text{E}(\cdot)\) denotes expectation; second, that \(\text{Cov}(\boldsymbol{\tau}) = \sigma^2_\tau \mathbf{R}\), where \(\mathbf{R}\) is an \(n \times n\) matrix that determines the spatial dependence structure in \(\mathbf{y}\) and depends on a range parameter, \(\phi\); third, that \(\text{Cov}(\boldsymbol{\epsilon}) = \sigma^2_\epsilon \mathbf{I}\), where \(\mathbf{I}\) is an \(n \times n\) identity matrix; and fourth, that \(\boldsymbol{\tau}\) and \(\boldsymbol{\epsilon}\) are independent of one another. The parameter \(\sigma^2_{\tau}\) is called the spatially dependent random error variance or partial sill. The parameter \(\sigma^2_\epsilon\) is called the spatially independent random error variance or nugget. These two variance parameters are henceforth more intuitively written as \(\sigma^2_{de}\) and \(\sigma^2_{ie}\), respectively. The covariance of \(\mathbf{y}\) is denoted \(\boldsymbol{\Sigma}\) and given by \[\begin{equation}\label{eq:spcov} \boldsymbol{\Sigma} = \sigma^2_{de}\mathbf{R} + \sigma^2_{ie} \mathbf{I}. \end{equation}\] The parameters \(\sigma^2_{de}\), \(\phi\), and \(\sigma^2_{ie}\) are elements of \(\boldsymbol{\theta}\), the covariance parameter vector.

The spatial linear model applies to both point-referenced and areal (i.e., lattice; polygon) data. Spatial data are point-referenced when the elements in \(\mathbf{y}\) are observed at point-locations indexed by x-coordinates and y-coordinates on a spatially continuous surface with an infinite number of locations. For example, consider sampling soil at any point-location in a field. Spatial linear models for point-referenced data are sometimes called geostatistical models. An example of an \(\mathbf{R}\) matrix for point-referenced data is the spherical correlation matrix given by \[\begin{equation}\label{eq:spherical} \mathbf{R} = \left(1 - 1.5\frac{\mathbf{H}}{\phi} + 0.5\frac{\mathbf{H}^3}{\phi^3}\right) \odot \mathcal{I}(\mathbf{H} \leq \phi), \end{equation}\]
where \(\mathbf{H}\) is a matrix of Euclidean distances among observations, \(\mathbf{H}^3\) is a matrix of cubed Euclidean distances, \(\odot\) is the Hadmard (element-wise; direct) product, and \(\mathcal{I}(\mathbf{H} \leq \phi)\) is an indicator function equal to one when the \(ij\)th element of \(\mathbf{H}\) is less than \(\phi\) (the range) and zero otherwise. A second is example is the Matérn correlation matrix given by \[\begin{equation}\label{eq:matern} \mathbf{R} = \frac{2^{1 - \nu}}{\Gamma(\nu)} \left(2\nu \mathbf{H} / \phi \right)^{\nu / 2} \odot B_k(2\nu \mathbf{H}, \nu), \end{equation}\] where \(\nu\) is a smoothness parameter, \(\odot\) is the Hadmart (element-wise; direct) product, and \(B_k(\cdot, \nu)\) is a Bessel-K function with order \(\nu\). Chiles and Delfiner (2012) provide examples of several spatial dependence structures.

Spatial data are areal when they are observed as part of a finite network of polygons whose connections are indexed by a neighborhood structure. For example, the polygons may represent counties in a state who are neighbors if they share at least one boundary. An example of an \(\mathbf{R}\) matrix for areal data is the simultaneous-autoregressive spatial correlation matrix given by \[\begin{equation*} \mathbf{R} = [(\mathbf{I} - \phi \mathbf{W})(\mathbf{I} - \phi \mathbf{W})^\top]^{-1}, \end{equation*}\] where \(\mathbf{W}\) is a weight matrix that describes the neighborhood structure among observations. Spatial linear models for areal data are sometimes called spatial autoregressive models. For thorough reviews of the spatial linear model, see Cressie (1993), Banerjee, Carlin, and Gelfand (2014), and Schabenberger and Gotway (2017).

The Spatial Generalized Linear Model

It is common to implicitly assume \(\mathbf{y}\) in the spatial linear model is unconstrained and Gaussian. This is a restrictive assumption and often inappropriate for binary data, proportion data, count data, and skewed data, which are common in many practical applications. These types of data are most naturally described using generalized linear models (Nelder and Wedderburn 1972; McCullagh and Nelder 1989). Generalized linear models are an extension of linear models that “link” a function of a distribution’s mean to the linear function \(\mathbf{X} \boldsymbol{\beta}\) (the explanatory variables and fixed effects). We can extend generalized linear models by adding linear, latent (i.e., unobserved) random effects to \(\mathbf{X} \boldsymbol{\beta}\). These types of models are called generalized linear mixed models (Breslow and Clayton 1993) and are often formulated using a hierarchical construction (Lee and Nelder 1996). For thorough reviews of generalized linear models and generalized linear mixed models, see Bolker et al. (2009), Myers et al. (2012), Faraway (2016), and Wood (2017).

A recent spmodel release (version 0.4.0) provided support for spatial generalized linear models (i.e., a generalized linear mixed model with spatial random effects). We define the spatial generalized linear model as \[\begin{equation}\label{eq:spglm} f(\boldsymbol{\mu}) \equiv \mathbf{w} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon}, \end{equation}\] where \(f(\cdot)\) is the link function, \(\boldsymbol{\mu}\) is the mean of \(\mathbf{y}\), and the remaining terms \(\mathbf{X}\), \(\boldsymbol{\beta}\), \(\boldsymbol{\tau}\), \(\boldsymbol{\epsilon}\) represent the same quantities as for spatial linear models. That is, \(\text{E}(\boldsymbol{\tau}) = \text{E}(\boldsymbol{\epsilon}) = \boldsymbol{0}\), \(\text{Cov}(\boldsymbol{\tau}) = \sigma^2_{de}\mathbf{R}\), \(\text{Cov}(\boldsymbol{\epsilon}) = \sigma^2_{ie} \mathbf{I}\), \(\boldsymbol{\tau}\) is independent of \(\boldsymbol{\epsilon}\), and \(\boldsymbol{\Sigma} = \sigma^2_{de} \mathbf{R} + \sigma^2_{ie} \mathbf{I}\). The link function, \(f(\cdot)\), “links” a function of \(\boldsymbol{\mu}\) to the linear term \(\mathbf{w}\). Note that the linking of \(\boldsymbol{\mu}\) to \(\mathbf{w}\) applies element-wise to each vector. Each link function \(f(\cdot)\) has a corresponding inverse link function, \(f^{-1}(\cdot)\), which “links” a function of \(\mathbf{w}\) to \(\boldsymbol{\mu}\). Notice that \(\mathbf{w}\) is unconstrained but \(\boldsymbol{\mu}\) is usually constrained in some way (e.g., positive) that depends on the distribution assumed for \(\mathbf{y}\).

Spatial linear and generalized linear models rely on the same components: explanatory variables, fixed effects, spatially dependent random errors, and spatial independent random errors. These components are related to the response linearly (spatial linear models) or non-linearly (spatial generalized linear models). The response itself is generally assumed to follow a Gaussian distribution (spatial linear models) or one of many non-Gaussian distributions (spatial generalized linear models). Some non-Gaussian distributions include the binomial distribution for binary responses, the beta distribution for proportion responses, the Poisson or negative binomial distributions for count responses, and the gamma or inverse Gaussian distributions for skewed, positive continuous responses.

spmodel implements a marginal approach to inference that uses a novel application of the Laplace approximation. This marginal approach is powerful and flexible, accommodating a wide range of covariance structures. It is a frequentist approach, which means that it does not rely on any Bayesian sampling routines, as is common when handling complex hierarchical structures. It also formally maximizes a likelihood, which is advantageous because it means that likelihood-based statistics such as AIC (Akaike 1974; Hoeting et al. 2006) and BIC (Schwarz 1978) are defined, in contrast to approaches that only specify the first two moments of a distribution (e.g., quasi-likelihood fit via iteratively re-weighted least squares). Ver Hoef et al. (2024) provide the methodology’s full details.

spmodel accommodates the binomial, beta, Poisson, negative binomial, gamma, and inverse Gaussian distributions. The binomial and beta response distributions (specified by ) use the logit link, while the Poisson, negative binomial, gamma, and inverse Gaussian distributions use the log link – see the following table.

Spatial generalized linear model response distributions available in spmodel alongside their corresponding link functions and data types.
Family Link Function Link Name Data Type
Binomial \(f(\mathbf{\mu}) = \log(\mathbf{\mu} / (1 - \mathbf{\mu}))\) Logit Binary; Binary Count
Poisson \(f(\mathbf{\mu}) = \log(\mathbf{\mu})\) Log Count
Negative Binomial \(f(\mathbf{\mu}) = \log(\mathbf{\mu})\) Log Count
Beta \(f(\mathbf{\mu}) = \log(\mathbf{\mu} / (1 - \mathbf{\mu}))\) Logit Proportion
Gamma \(f(\mathbf{\mu}) = \log(\mathbf{\mu})\) Log Skewed
Inverse Gaussian \(f(\mathbf{\mu}) = \log(\mathbf{\mu})\) Log Skewed

Model fitting

Spatial generalized linear models in spmodel are fit using (for point-referenced data) or (for areal data). These functions are similar in structure both to one another and to the function in base R. They generally require the following four arguments: , a formula that describes the relationship between the response and explanatory variables; , the response distribution, which can be , , , , , or ; , a or object (Pebesma 2018) whose rows index observations and whose columns contain the response variable, explanatory variables, and spatial information; and , one of 19 spatial covariance types (e.g., spherical spatial covariance, Matérn spatial covariance, simultaneous-autoregressive spatial covariance etc.). The binomial and beta response distributions (specified by ) use the logit link, while the Poisson, negative binomial, gamma, and inverse Gaussian distributions use the log link. Other arguments to and are either required only in specific situations or have default values specified. All arguments to the and functions are summarized in the following table. Models for several hundred observations can take anywhere from a few seconds to a few minutes to fit, depending on the shape of the likelihood surface and how quickly the covariance parameters converge. Models for several thousand observations can utilize spmodel’s large data set options.

Arguments to spglm() and spgautor(), their purpose, and whether the argument applies to both spglm() and spgautor() or just one of them.
Argument Purposes Applies To
\(\texttt{formula}\) Fixed effects formula Both
\(\texttt{family}\) Response distribution Both
\(\texttt{data}\) Data Frame or object Both
\(\texttt{spcov_type}\) Spatial covariance type Both
\(\texttt{spcov_initial}\) Initial or known spatial covariance parameters Both
\(\texttt{dispersion_initial}\) Initial or known dispersion parameter Both
\(\texttt{estmethod}\) Estimation method Both
\(\texttt{random}\) Random effects formula Both
\(\texttt{randcov_initial}\) Initial or known random effect variances Both
\(\texttt{partition_factor}\) A partition factor formula Both
\(\texttt{...}\) Additional arguments Both
\(\texttt{anisotropy}\) Whether to model anisotropy \(\texttt{spglm()}\)
\(\texttt{xcoord}\) An x-coordinate name \(\texttt{spglm()}\)
\(\texttt{ycoord}\) A y-coordinate name \(\texttt{spglm()}\)
\(\texttt{local}\) Options for large data sets \(\texttt{spglm()}\)
\(\texttt{W}\) A neighborhood weight matrix \(\texttt{spgautor()}\)
\(\texttt{row_st}\) Whether to row-standardize \(\texttt{W}\) \(\texttt{spgautor()}\)
\(\texttt{M}\) A symmetry matrix \(\texttt{spgautor()}\)
\(\texttt{range_positive}\) Whether the range parameter is restricted to be positive
\(\texttt{cutoff}\) Distance cutoff if using distance-based neighbor definition

Model evaluation and diagnostics

After fitting a model, it is usually beneficial to evaluate certain aspects of model fit or inspect and visualize model diagnostics like residuals. There are many functions in spmodel that are used for these purposes, and their main argument is the fitted model object returned by or . The names of these functions are meant to be illustrative, describing the purpose of the function itself. The functions for evaluating model fit (see below) return likelihood-based statistics like AIC, AICc, BIC, and deviance (McCullagh and Nelder 1989), return coefficient estimates and confidence intervals, perform cross validation, and more.

Model evaluation functions available for models fit using spglm() or spgautor() and their purpose.
Function Purpose
\(\texttt{AIC()}\); \(\texttt{AICc()}\) Return AIC and AICc
\(\texttt{anova()}\) Perform an analysis of variance
\(\texttt{AUROC()}\) Compute the area under the receiver operating characteristic curve
\(\texttt{BIC()}\) Return BIC
\(\texttt{coef()}\); \(\texttt{coefficients()}\) Return parameter estimates
\(\texttt{confint()}\) Return confidence intervals
\(\texttt{covmatrix()}\) Return the fitted covariance matrix
\(\texttt{deviance()}\) Return the deviance
\(\texttt{glance()}\); \(\texttt{glances()}\) Glance at model fits
\(\texttt{logLik()}\) Return the log-likelihood
\(\texttt{loocv()}\) Perform leave-one-out cross validation
\(\texttt{pseudoR2()}\) Return the pseudo R-squared
\(\texttt{summary()}\) Summarize the fitted model
\(\texttt{tidy()}\) Tidy the fitted model
\(\texttt{varcomp()}\) Compare sources of variability
\(\texttt{vcov()}\) Return variance-covariance matrices

The functions for inspecting model diagnostics (see below) return Cook’s distances (Cook 1979), fitted values, leverage (hat) values, residuals (of several types), and visualizations.

Mode diagnostic functions available for models fit using spglm() or spgautor() and their purpose.
Function Purpose
\(\texttt{augment()}\) Augment model data with diagnostics
\(\texttt{cooks.distance()}\) Return Cook’s distances
\(\texttt{fitted()}\); \(\texttt{fitted.values()}\) Return fitted values
\(\texttt{hatvalues()}\); Return hat (leverage) values
\(\texttt{plot()}\) Plot model diagnostics
\(\texttt{residuals()}\); \(\texttt{resid()}\); \(\texttt{rstandard()}\) Return residuals

The functions , , and , popularized by the broom R package (Robinson, Hayes, and Couch 2021), are particularly useful. The function tidies parameter estimate output, returning a special called a (Müller and Wickham 2021) that is much easier to manipulate and work with than the parameter estimate output printed by to the R console. The function glances at the model fit, returning a with the sample size, number of estimated parameters, and several model-fit statistics. is an extension of that glances at multiple models simultaneously. The function augments the model fit, returning a with the data used to fit the model as well as model diagnostics. The function can also be used to augment prediction data, as we discuss next.

Prediction

Often a priority of a spatial data analysis is using the fitted model to make predictions at unobserved locations and quantify uncertainties in those predictions. The function in spmodel is used to predict the mean process at unobserved locations. It is similar in structure to for objects fit using in base R and takes two main arguments: , the fitted model object returned by or ; and , a or object whose rows index locations requiring prediction and whose columns contain the explanatory variables and spatial information at these locations. Other arguments to are either required only in specific situations or have default values specified. All arguments to the function are summarized in the following table. The function can also be used to augment with predictions, taking the same arguments as .

Arguments to predict() and their purpose.
Argument Purpose
\(\texttt{object}\) Fitted model object
\(\texttt{newdata}\) New data frame or object
\(\texttt{type}\) Whether predictions are on the link or response scale
\(\texttt{se.fit}\) Whether standard errors should be returned
\(\texttt{interval}\) Interval type (prediction or confidence)
\(\texttt{level}\) Prediction or confidence level
\(\texttt{local}\) Options for large data sets
\(\texttt{var_correct}\) Whether prediction variances should be adjusted
\(\texttt{...}\) Additional arguments

Simulation

spmodel has functions that are used to simulate spatial data from Gaussian, binomial, beta, Poisson, negative binomial, gamma, and inverse Gaussian response distributions. These simulation functions are similar in structure to commonly used base R functions for simulating data. They are also similar in name, simply adding an prefix to the conventional name in base R. For example, is used to simulate Poisson data in base R, while is used to simulate spatial Poisson data in spmodel. These functions are summarized in the following table.

Functions to simulate spatial data and their purpose.
Function Purpose
\(\texttt{sprnorm()}\) Simulate spatial Gaussian data
\(\texttt{sprbinom()}\) Simulate spatial binomial data
\(\texttt{sprbeta()}\) Simulate spatial beta data
\(\texttt{sprpois()}\) Simulate spatial Poisson data
\(\texttt{sprnbinom()}\) Simulate spatial negative binomial data
\(\texttt{sprgamma()}\) Simulate spatial gamma data
\(\texttt{sprinvgauss()}\) Simulate spatial inverse Gaussian data

An application to binary (presence/absence) data

In the next two sections we to show how to use spmodel to analyze data collected on moose in the Togiak region of Alaska, USA. The data in spmodel contain information on moose counts from a survey of sites in the Togiak region. The data are stored as an sf object, a special built to store spatial data. has several variables: ; a stratification variable with two levels ( for low and for medium) based on surrounding landscape metrics; , the site elevation; , whether or not at least one moose was observed at the site; , the number of moose observed at a the site; and , the spatial coordinates in an Alaska Albers projection (EPSG: 3338). We load spmodel and the data by running

We are interested in quantifying the effects of and on moose and begin with some visualizations. First, we visualize the presence for observed sites:

ggplot(moose, aes(color = presence)) + 
  geom_sf(size = 2) + 
  scale_color_viridis_d(option = "H") +
  theme_gray(base_size = 14)
Moose presence.

Moose presence.

There is spatial patterning in the distribution of moose presence, as moose tend to be present in the eastern and southwestern parts of the domain. This spatial pattern seems to have a directional orientation, seemingly strongest in the northwest-to-southeast direction.

Next, we visualize the elevation for each strata:

ggplot(moose, aes(color = elev)) +
  geom_sf(size = 1) + 
  facet_grid(~ strat) +
  scale_color_viridis_c()  +
  scale_y_continuous(breaks = seq(62.6, 63.6, length.out = 5)) +
  scale_x_continuous(breaks = seq(-148.5, -146, length.out = 3)) +
  theme_gray(base_size = 12)
Elevation versus strata in the moose data.

Elevation versus strata in the moose data.

There are no obvious spatial patterns in the assignment of sites to strata, while elevation is highest on the boundaries of the domain and lowest in the middle of the domain. There is spatial patterning in the distribution of moose presence, as moose tend to be present in the eastern and southwestern parts of the domain. This spatial pattern seems to have a directional orientation, seemingly strongest in the northwest-to-southeast direction.

To quantify the effects of , , and spatial dependence on moose presence, we fit three binomial (i.e., ) logistic regression models using :

# model one
bin_mod <- spglm(
  formula = presence ~ elev + strat + elev:strat,
  family = "binomial",
  data = moose,
  spcov_type = "none"
)
# model two
bin_spmod <- spglm(
  formula = presence ~ elev + strat + elev:strat,
  family = "binomial",
  data = moose,
  spcov_type = "spherical"
)
# model three
bin_spmod_anis <- spglm(
  formula = presence ~ elev + strat + elev:strat,
  family = "binomial",
  data = moose,
  spcov_type = "spherical",
  anisotropy = TRUE
)

All three models have the same fixed effect structure, using elevation, strata, and their interaction to explain moose presence. The three models vary in their covariance structure. The first model, , has no spatial covariance (\(\sigma^2_{de} = 0\)). The second model, has a spherical spatial covariance. The third model, , has a spherical spatial covariance that incorporates geometric anisotropy. Geometric anisotropy allows the spatial covariance to vary with direction by evaluating the spatial covariance with \(\mathbf{H}^*\) (instead of \(\mathbf{H}\)), a matrix of distances between transformed coordinates that are rotated and scaled appropriately (Schabenberger and Gotway 2017). We evaluate the three models using AIC by running

AIC(bin_mod, bin_spmod, bin_spmod_anis)
#>                df      AIC
#> bin_mod         0 717.1627
#> bin_spmod       3 686.9031
#> bin_spmod_anis  5 677.9403

The spatial models (, ) have a much lower AIC than the non-spatial model (), which suggests that the models benefit from incorporating spatial dependence. has a lower AIC than , which suggests that the model benefits from incorporating directionality in the spatial dependence. Next we inspect and later use it to predict moose presence probability at unobserved sites.

We summarize using :

summary(bin_spmod_anis)
#> 
#> Call:
#> spglm(formula = presence ~ elev + strat + elev:strat, family = "binomial", 
#>     data = moose, spcov_type = "spherical", anisotropy = TRUE)
#> 
#> Deviance Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.7970 -0.7348  0.3291  0.6855  1.7384 
#> 
#> Coefficients (fixed):
#>              Estimate Std. Error z value Pr(>|z|)   
#> (Intercept) -3.112818   1.522362  -2.045  0.04088 * 
#> elev         0.010990   0.004655   2.361  0.01824 * 
#> stratM       3.820320   1.176127   3.248  0.00116 **
#> elev:stratM -0.014020   0.006933  -2.022  0.04316 * 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Pseudo R-squared: 0.1119
#> 
#> Coefficients (spherical spatial covariance):
#>        de        ie     range    rotate     scale 
#> 6.580e+00 2.220e-03 1.791e+05 2.718e+00 2.154e-01 
#> attr(,"class")
#> [1] "spherical"
#> 
#> Coefficients (Dispersion for binomial family):
#> dispersion 
#>          1

The output from an model is very similar to the output from a base-R model, returning the original function call, deviance residuals, a fixed effects coefficients table, a pseudo R-squared (which quantifies the amount of variability explained by the fixed effects), and covariance parameter coefficients. The fixed effects coefficients table provides some evidence that elevation and strata are associated with with moose presence and that the effect of elevation varies across the two strata (all \(p~\)values \(<\) 0.05).

The area under the receiver receiver operating characteristic (AUROC) curve quantifies the effectiveness of a classifier (Robin et al. 2011). It ranges from zero to one, and higher values indicate better model performance. The AUROC of is

AUROC(bin_spmod_anis)
#> [1] 0.9403199

The function returns graphics of several model diagnostics. Running

plot(bin_spmod_anis, which = c(7, 8))
Spatial dependence is a function of distance (left) and as an anisotropic level curve of equal correlation (right).Spatial dependence is a function of distance (left) and as an anisotropic level curve of equal correlation (right).

Spatial dependence is a function of distance (left) and as an anisotropic level curve of equal correlation (right).

shows that the spatial dependence is evident and appears strongest in the northwest-to-southeast direction. These findings are supported by the clear spatial patterns in the data.

Recall that the , , and functions are particularly useful tools for manipulating and understanding fitted model objects. The function tidies model output, returning a of parameter estimates (and confidence intervals):

tidy(bin_spmod_anis, conf.int = TRUE)
#> # A tibble: 4 × 7
#>   term        estimate std.error statistic p.value conf.low conf.high
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
#> 1 (Intercept)  -3.11     1.52        -2.04 0.0409  -6.10    -0.129   
#> 2 elev          0.0110   0.00466      2.36 0.0182   0.00187  0.0201  
#> 3 elev:stratM  -0.0140   0.00693     -2.02 0.0432  -0.0276  -0.000431
#> 4 stratM        3.82     1.18         3.25 0.00116  1.52     6.13

By default, the fixed effects estimates are returned, but covariance parameter estimates are returned via the argument:

tidy(bin_spmod_anis, effects = "spcov")
#> # A tibble: 5 × 3
#>   term       estimate is_known
#>   <chr>         <dbl> <lgl>   
#> 1 de          6.58    FALSE   
#> 2 ie          0.00222 FALSE   
#> 3 range  179082.      FALSE   
#> 4 rotate      2.72    FALSE   
#> 5 scale       0.215   FALSE

The column indicates whether covariance parameters were assumed known, which is possible to specify using the argument to . Here, all parameters were assumed unknown and then estimated.

The function glances at model fit, returning a of model fit statistics:

glance(bin_spmod_anis)
#> # A tibble: 1 × 10
#>       n     p  npar value   AIC  AICc   BIC logLik deviance pseudo.r.squared
#>   <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>            <dbl>
#> 1   218     4     5  668.  678.  678.  695.  -334.     161.            0.112

can be used to glance at multiple models simultaneously and sorts models by ascending AICc:

glances(bin_mod, bin_spmod, bin_spmod_anis)
#> # A tibble: 3 × 11
#>   model              n     p  npar value   AIC  AICc   BIC logLik deviance
#>   <chr>          <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>
#> 1 bin_spmod_anis   218     4     5  668.  678.  678.  695.  -334.     161.
#> 2 bin_spmod        218     4     3  681.  687.  687.  697.  -340.     161.
#> 3 bin_mod          218     4     0  717.  717.  717.  717.  -359.     294.
#> # ℹ 1 more variable: pseudo.r.squared <dbl>

The function augments the data with model diagnostics:

aug_mod <- augment(bin_spmod_anis)
aug_mod
#> Simple feature collection with 218 features and 8 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 269085 ymin: 1416151 xmax: 419057.4 ymax: 1541016
#> Projected CRS: NAD83 / Alaska Albers
#> # A tibble: 218 × 9
#>    presence  elev strat .fitted .resid    .hat   .cooksd .std.resid
#>  * <fct>    <dbl> <chr>   <dbl>  <dbl>   <dbl>     <dbl>      <dbl>
#>  1 0         469. L      -1.91  -0.527 0.0923  0.00777       -0.553
#>  2 0         362. L      -3.58  -0.235 0.00973 0.000137      -0.236
#>  3 0         173. M      -3.49  -0.246 0.00136 0.0000205     -0.246
#>  4 0         280. L      -4.41  -0.155 0.00187 0.0000113     -0.155
#>  5 0         620. L      -0.964 -0.804 0.360   0.142         -1.01 
#>  6 0         164. M      -2.64  -0.372 0.00286 0.0000996     -0.373
#>  7 0         164. M      -2.24  -0.449 0.00397 0.000202      -0.450
#>  8 0         186. L      -2.51  -0.396 0.00486 0.000192      -0.397
#>  9 0         362. L      -2.73  -0.355 0.0211  0.000692      -0.359
#> 10 0         430. L      -1.62  -0.600 0.0871  0.00941       -0.628
#> # ℹ 208 more rows
#> # ℹ 1 more variable: geometry <POINT [m]>

where are fitted values (the estimated \(f^{-1}(\mathbf{w})\), or moose presence probability), are response residuals, values indicate leverage, values indicate Cook’s distance (i.e., influence), and are standardized residuals.

The data contains survey sites that were not sampled. We can use to make predictions of the underlying probabilities of moose presence at these sites using or . and return the same predictions but augments the prediction data with the predictions:

data("moose_preds")

spmod_anis_preds <- predict(
  bin_spmod_anis,
  newdata = moose_preds,
  type = "response",
  interval = "prediction"
)
head(spmod_anis_preds)
#>          fit         lwr       upr
#> 1 0.53625613 0.150268897 0.8831963
#> 2 0.45236331 0.106721924 0.8509940
#> 3 0.09535359 0.010375867 0.5144805
#> 4 0.10772261 0.022873507 0.3837176
#> 5 0.83772891 0.466479245 0.9682358
#> 6 0.03276524 0.003867679 0.2281264
aug_pred <- augment(
  bin_spmod_anis,
  newdata = moose_preds,
  type.predict = "response", 
  interval = "prediction"
)
aug_pred
#> Simple feature collection with 100 features and 5 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 269386.2 ymin: 1418453 xmax: 419976.2 ymax: 1541763
#> Projected CRS: NAD83 / Alaska Albers
#> # A tibble: 100 × 6
#>     elev strat .fitted  .lower .upper           geometry
#>  * <dbl> <chr>   <dbl>   <dbl>  <dbl>        <POINT [m]>
#>  1  143. L      0.536  0.150    0.883 (401239.6 1436192)
#>  2  324. L      0.452  0.107    0.851 (352640.6 1490695)
#>  3  158. L      0.0954 0.0104   0.514 (360954.9 1491590)
#>  4  221. M      0.108  0.0229   0.384 (291839.8 1466091)
#>  5  209. M      0.838  0.466    0.968 (310991.9 1441630)
#>  6  218. L      0.0328 0.00387  0.228 (304473.8 1512103)
#>  7  127. L      0.0381 0.00511  0.234 (339011.1 1459318)
#>  8  122. L      0.0747 0.00983  0.396 (342827.3 1463452)
#>  9  191  L      0.375  0.0410   0.893 (284453.8 1502837)
#> 10  105. L      0.142  0.0197   0.577 (391343.9 1483791)
#> # ℹ 90 more rows

Estimated moose presence probability for the data (obtained via ) and predicted moose presence probability for the data (obtained via or ) are overlain below and share similar patterns:

aug_mod$type <- "mod"
aug_pred$type <- "pred"
keep_cols <- c(".fitted", "type")
aug_combined <- rbind(aug_mod[, keep_cols], aug_pred[, keep_cols])
ggplot(aug_combined, aes(color = .fitted, shape = type)) +
  geom_sf(size = 2) +
  scale_color_viridis_c(option = "H")  +
  theme_gray(base_size = 14)
Fitted values and predictions for moose presence probability.

Fitted values and predictions for moose presence probability.

An application to count data

The data also contain the number of moose observed at each site:

ggplot(moose, aes(color = count)) +
  geom_sf(size = 2) +
  scale_color_viridis_c(option = "H")  +
  theme_gray(base_size = 14)
Moose counts.

Moose counts.

The moose counts are similarly distributed as moose presence, highest in the eastern and southwestern parts of the domain. We compare two count models, Poisson and negative binomial. The Poisson model assumes the underlying process generating the counts has the same mean and variance while the negative binomial model allows for overdispersion (i.e., the variance is greater than the mean). We are interested in quantifying the effects of and on moose count, albeit using a slightly different approach than we did for moose presence. Here, we will model elevation as a fixed effect and allow this effect to change between strata, but we will model strata as a random effect to highlight additional flexibility of the spmodel package. We fit relevant Poisson and negative binomial models with a Matérn spatial covariance using :

count_mod_pois <- spglm(
  count ~ elev + elev:strat,
  family = "poisson",
  data = moose,
  spcov_type = "matern",
  random = ~ (1 | strat)
)

count_mod_nb <- spglm(
  count ~ elev + elev:strat,
  family = "nbinomial",
  data = moose,
  spcov_type = "matern",
  random = ~ (1 | strat)
)

Random effects are specified in spmodel via the argument using similar syntax as the commonly used lme4 (Bates et al. 2015) and nlme (Pinheiro and Bates 2006) R packages for non-spatial mixed models. In spmodel, is short-hand for . Random effects alter the covariance structure of the model, building additional correlation into the model for sites sharing a level of the random effect (here, sites within the same strata). More formally, when incorporating a random effect, the spatial generalized linear model becomes \[\begin{equation}\label{eq:spglm-rand} f(\boldsymbol{\mu}) \equiv \mathbf{w} = \mathbf{X} \boldsymbol{\beta} + \mathbf{Z}\mathbf{v} + \boldsymbol{\tau} + \boldsymbol{\epsilon}, \end{equation}\] where \(\mathbf{Z}\) is a design matrix that indexes the random effects, \(\mathbf{v}\), and \(\text{Cov}(\mathbf{v}) = \sigma^2_{v}\mathbf{I}\). Then the covariance matrix, \(\boldsymbol{\Sigma}\), becomes \(\sigma^2_v \mathbf{Z} \mathbf{Z}^\top + \sigma^2_{de}\mathbf{R} + \sigma^2_{ie}\mathbf{I}\).

Previously we compared models using , but another way to compare models is by leave-one-out cross validation (Hastie et al. 2009). In leave-one-out cross validation, each observation is held-out and the model is re-fit and used to predict the held-out observation. Then a loss statistic is computed that compares each prediction to its true value. We calculate mean-squared-prediction error (MSPE) using leave-one-out cross validation for both the Poisson and negative binomial model by running

loocv(count_mod_pois)
#> # A tibble: 1 × 3
#>    bias  MSPE RMSPE
#>   <dbl> <dbl> <dbl>
#> 1  1.37  32.0  5.66
loocv(count_mod_nb)
#> # A tibble: 1 × 3
#>    bias  MSPE RMSPE
#>   <dbl> <dbl> <dbl>
#> 1 0.291  27.1  5.20

The negative binomial model has a lower MSPE, which suggests that incorporating overdispersion improves model fit here. Tidying this fitted model we see that there is some evidence elevation is related to moose counts and that this effect changes across strata (all \(p~\)values \(< 0.1\)):

tidy(count_mod_nb)
#> # A tibble: 3 × 5
#>   term        estimate std.error statistic p.value
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept) -0.301     1.50       -0.200  0.841 
#> 2 elev         0.00590   0.00300     1.96   0.0494
#> 3 elev:stratM -0.00717   0.00408    -1.76   0.0789

The function in spmodel apportions model variability into fixed and random components:

varcomp(count_mod_nb)
#> # A tibble: 4 × 2
#>   varcomp            proportion
#>   <chr>                   <dbl>
#> 1 Covariates (PR-sq)    0.0329 
#> 2 de                    0.498  
#> 3 ie                    0.00174
#> 4 1 | strat             0.467

Most model variability is explained by the spatially dependent random error () and the random effect for strata (). Running

plot(count_mod_nb, which = c(4, 5))
Model diagnostics.Model diagnostics.

Model diagnostics.

shows observations of high influence or leverage and that the standardized residuals tend to be spread out around zero.

Estimated mean moose counts for the data (obtained via ) and predicted mean moose counts for the data (obtained via or ) share similar patterns and are overlain by running

aug_mod <- augment(count_mod_nb)
aug_pred <- augment(count_mod_nb, newdata = moose_preds, type.predict = "response")
aug_mod$type <- "mod"
aug_pred$type <- "pred"
keep_cols <- c(".fitted", "type")
aug_combined <- rbind(aug_mod[, keep_cols], aug_pred[, keep_cols])
ggplot(aug_combined, aes(color = .fitted, shape = type)) +
  geom_sf(size = 2) +
  scale_color_viridis_c(option = "H")  +
  theme_gray(base_size = 18)
Fitted values and predictions for moose counts.

Fitted values and predictions for moose counts.

Discussion

Throughout this vignette, we have shown several features spmodel offers, including a novel application of the Laplace approximation, similarity to base R’s function, over a dozen spatial covariance functions, a variety of tools available to evaluate models, inspect model diagnostics, and make predictions using ubiquitous base R functions (e.g., , , and ) and more. Spatial generalized linear models for point-referenced data (i.e., generalized geostatistical models) are fit using the function. Spatial generalized linear models for areal data (i.e., generalized spatial autoregressive models) are fit using the function. Both functions share common structure and syntax. Spatial data are simulated in spmodel by adding an prefix to commonly used base R simulation functions (e.g., ).

We appreciate feedback from users regarding spmodel. To learn more about how to provide feedback or contribute to spmodel, please visit our GitHub repository at https://github.com/USEPA/spmodel.

References

Akaike, Hirotugu. 1974. “A New Look at the Statistical Model Identification.” IEEE Transactions on Automatic Control 19 (6): 716–23.
Appelhans, Tim, Florian Detsch, Christoph Reudenbach, and Stefan Woellauer. 2022. Mapview: Interactive Viewing of Spatial Data in r. https://CRAN.R-project.org/package=mapview.
Banerjee, Sudipto, Bradley P Carlin, and Alan E Gelfand. 2014. Hierarchical Modeling and Analysis for Spatial Data. CRC press.
Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.
Bolker, Benjamin M, Mollie E Brooks, Connie J Clark, Shane W Geange, John R Poulsen, M Henry H Stevens, and Jada-Simone S White. 2009. “Generalized Linear Mixed Models: A Practical Guide for Ecology and Evolution.” Trends in Ecology & Evolution 24 (3): 127–35.
Breslow, Norman E, and David G Clayton. 1993. “Approximate Inference in Generalized Linear Mixed Models.” Journal of the American Statistical Association 88 (421): 9–25.
Chiles, Jean-Paul, and Pierre Delfiner. 2012. Geostatistics: Modeling Spatial Uncertainty. Vol. 713. John Wiley & Sons.
Cook, R Dennis. 1979. “Influential Observations in Linear Regression.” Journal of the American Statistical Association 74 (365): 169–74.
Cressie, Noel. 1993. Statistics for Spatial Data. John Wiley & Sons.
Faraway, Julian J. 2016. Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. CRC press.
Hastie, Trevor, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Vol. 2. Springer.
Hoeting, Jennifer A, Richard A Davis, Andrew A Merton, and Sandra E Thompson. 2006. “Model Selection for Geostatistical Models.” Ecological Applications 16 (1): 87–98.
Lee, Youngjo, and John A Nelder. 1996. “Hierarchical Generalized Linear Models.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (4): 619–56.
McCullagh, Peter, and John A. Nelder. 1989. Generalized Linear Models, Second Edition. Chapman; Hall Ltd.
Müller, Kirill, and Hadley Wickham. 2021. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.
Myers, Raymond H, Douglas C Montgomery, G Geoffrey Vining, and Timothy J Robinson. 2012. Generalized Linear Models: With Applications in Engineering and the Sciences. John Wiley & Sons.
Nelder, John Ashworth, and Robert WM Wedderburn. 1972. “Generalized Linear Models.” Journal of the Royal Statistical Society: Series A (General) 135 (3): 370–84.
Pebesma, Edzer. 2018. Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.
Pinheiro, José, and Douglas Bates. 2006. Mixed-Effects Models in s and s-PLUS. Springer science & business media.
Robin, Xavier, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez, and Markus Müller. 2011. “pROC: An Open-Source Package for r and s+ to Analyze and Compare ROC Curves.” BMC Bioinformatics 12: 77.
Robinson, David, Alex Hayes, and Simon Couch. 2021. broom: Convert Statistical Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom.
Schabenberger, Oliver, and Carol A Gotway. 2017. Statistical Methods for Spatial Data Analysis. CRC press.
Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics, 461–64.
Ver Hoef, Jay M, Eryn Blagg, Michael Dumelle, Philip M Dixon, Dale L Zimmerman, and Paul B Conn. 2024. “Marginal Inference for Hierarchical Generalized Linear Mixed Models with Patterned Covariance Matrices Using the Laplace Approximation.” Environmetrics, e2872.
Wickham, Hadley. 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wood, Simon N. 2017. Generalized Additive Models: An Introduction with R. CRC press.