vignettes/articles/SPGLMs.Rmd
SPGLMs.Rmd
spmodel
is an package used to fit, summarize, and
predict for a variety of spatial statistical models. The vignette
provides an introduction spatial generalized linear models for
non-Gaussian response distributions in spmodel
. Before
proceeding, we load spmodel
by running
If using spmodel
in a formal publication or report,
please cite it. Citing spmodel
lets us devote more
resources to the package in the future. We view the spmodel
citation by running
citation(package = "spmodel")
#> To cite spmodel in publications use:
#>
#> Dumelle M, Higham M, Ver Hoef JM (2023). spmodel: Spatial statistical
#> modeling and prediction in R. PLOS ONE 18(3): e0282524.
#> https://doi.org/10.1371/journal.pone.0282524
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> title = {{spmodel}: Spatial statistical modeling and prediction in {R}},
#> author = {Michael Dumelle and Matt Higham and Jay M. {Ver Hoef}},
#> journal = {PLOS ONE},
#> year = {2023},
#> volume = {18},
#> number = {3},
#> pages = {1--32},
#> doi = {10.1371/journal.pone.0282524},
#> url = {https://doi.org/10.1371/journal.pone.0282524},
#> }
We will create visualizations using ggplot2
(Wickham 2016), which we load by running
ggplot2
is only installed alongside spmodel
when dependencies = TRUE
in
install.packages()
, so check that the package is installed
and loaded before reproducing any of these vignette’s visualizations. We
will also show code that can be used to create interactive
visualizations of spatial data with mapview
(Appelhans et al. 2022). mapview
also has many backgrounds available that contextualize spatial data with
topographical information. Before running the mapview
code
interactively, make sure mapview
is installed and
loaded.
spmodel
contains various methods for generic functions
defined outside of spmodel
. To find relevant documentation
for these methods, run help("generic.spmodel", "spmodel")
(e.g., help("summary.spmodel", "spmodel")
,
help("predict.spmodel", "spmodel")
, etc.). Note that
?generic.spmodel
is shorthand for
help("generic.spmodel", "spmodel")
. We provide more details
and examples regarding these methods and generics throughout this
vignette. For a full list of spmodel
functions available,
see spmodel
’s documentation manual.
We first focus on spatial linear models for a Gaussian response vector. The spatial linear model is defined as \[\begin{equation}\label{eq:splm} \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon}, \end{equation}\] where for a sample size \(n\), \(\mathbf{y}\) is an \(n \times 1\) column vector of response variables, \(\mathbf{X}\) is an \(n \times p\) design (model) matrix of explanatory variables, \(\boldsymbol{\beta}\) is a \(p \times 1\) column vector of fixed effects controlling the impact of \(\mathbf{X}\) on \(\mathbf{y}\), \(\boldsymbol{\tau}\) is an \(n \times 1\) column vector of spatially dependent random errors, and \(\boldsymbol{\epsilon}\) is an \(n \times 1\) column vector of spatially independent random errors. We make a few assumptions about \(\boldsymbol{\tau}\) and \(\boldsymbol{\epsilon}\) in the spatial linear model: first, that \(\text{E}(\boldsymbol{\tau}) = \text{E}(\boldsymbol{\epsilon}) = \boldsymbol{0}\), where \(\text{E}(\cdot)\) denotes expectation; second, that \(\text{Cov}(\boldsymbol{\tau}) = \sigma^2_\tau \mathbf{R}\), where \(\mathbf{R}\) is an \(n \times n\) matrix that determines the spatial dependence structure in \(\mathbf{y}\) and depends on a range parameter, \(\phi\); third, that \(\text{Cov}(\boldsymbol{\epsilon}) = \sigma^2_\epsilon \mathbf{I}\), where \(\mathbf{I}\) is an \(n \times n\) identity matrix; and fourth, that \(\boldsymbol{\tau}\) and \(\boldsymbol{\epsilon}\) are independent of one another. The parameter \(\sigma^2_{\tau}\) is called the spatially dependent random error variance or partial sill. The parameter \(\sigma^2_\epsilon\) is called the spatially independent random error variance or nugget. These two variance parameters are henceforth more intuitively written as \(\sigma^2_{de}\) and \(\sigma^2_{ie}\), respectively. The covariance of \(\mathbf{y}\) is denoted \(\boldsymbol{\Sigma}\) and given by \[\begin{equation}\label{eq:spcov} \boldsymbol{\Sigma} = \sigma^2_{de}\mathbf{R} + \sigma^2_{ie} \mathbf{I}. \end{equation}\] The parameters \(\sigma^2_{de}\), \(\phi\), and \(\sigma^2_{ie}\) are elements of \(\boldsymbol{\theta}\), the covariance parameter vector.
The spatial linear model applies to both point-referenced and areal
(i.e., lattice; polygon) data. Spatial data are point-referenced when
the elements in \(\mathbf{y}\) are
observed at point-locations indexed by x-coordinates and y-coordinates
on a spatially continuous surface with an infinite number of locations.
For example, consider sampling soil at any point-location in a field.
Spatial linear models for point-referenced data are sometimes called
geostatistical models. An example of an \(\mathbf{R}\) matrix for point-referenced
data is the spherical correlation matrix given by \[\begin{equation}\label{eq:spherical}
\mathbf{R} = \left(1 - 1.5\frac{\mathbf{H}}{\phi} +
0.5\frac{\mathbf{H}^3}{\phi^3}\right) \odot \mathcal{I}(\mathbf{H} \leq
\phi),
\end{equation}\]
where \(\mathbf{H}\) is a matrix of
Euclidean distances among observations, \(\mathbf{H}^3\) is a matrix of cubed
Euclidean distances, \(\odot\) is the
Hadmard (element-wise; direct) product, and \(\mathcal{I}(\mathbf{H} \leq \phi)\) is an
indicator function equal to one when the \(ij\)th element of \(\mathbf{H}\) is less than \(\phi\) (the range) and zero otherwise. A
second is example is the Matérn correlation matrix given by \[\begin{equation}\label{eq:matern}
\mathbf{R} = \frac{2^{1 - \nu}}{\Gamma(\nu)} \left(2\nu \mathbf{H} /
\phi \right)^{\nu / 2} \odot B_k(2\nu \mathbf{H}, \nu),
\end{equation}\] where \(\nu\)
is a smoothness parameter, \(\odot\) is
the Hadmart (element-wise; direct) product, and \(B_k(\cdot, \nu)\) is a Bessel-K function
with order \(\nu\). Chiles and Delfiner (2012) provide examples of
several spatial dependence structures.
Spatial data are areal when they are observed as part of a finite network of polygons whose connections are indexed by a neighborhood structure. For example, the polygons may represent counties in a state who are neighbors if they share at least one boundary. An example of an \(\mathbf{R}\) matrix for areal data is the simultaneous-autoregressive spatial correlation matrix given by \[\begin{equation*} \mathbf{R} = [(\mathbf{I} - \phi \mathbf{W})(\mathbf{I} - \phi \mathbf{W})^\top]^{-1}, \end{equation*}\] where \(\mathbf{W}\) is a weight matrix that describes the neighborhood structure among observations. Spatial linear models for areal data are sometimes called spatial autoregressive models. For thorough reviews of the spatial linear model, see Cressie (1993), Banerjee, Carlin, and Gelfand (2014), and Schabenberger and Gotway (2017).
It is common to implicitly assume \(\mathbf{y}\) in the spatial linear model is unconstrained and Gaussian. This is a restrictive assumption and often inappropriate for binary data, proportion data, count data, and skewed data, which are common in many practical applications. These types of data are most naturally described using generalized linear models (Nelder and Wedderburn 1972; McCullagh and Nelder 1989). Generalized linear models are an extension of linear models that “link” a function of a distribution’s mean to the linear function \(\mathbf{X} \boldsymbol{\beta}\) (the explanatory variables and fixed effects). We can extend generalized linear models by adding linear, latent (i.e., unobserved) random effects to \(\mathbf{X} \boldsymbol{\beta}\). These types of models are called generalized linear mixed models (Breslow and Clayton 1993) and are often formulated using a hierarchical construction (Lee and Nelder 1996). For thorough reviews of generalized linear models and generalized linear mixed models, see Bolker et al. (2009), Myers et al. (2012), Faraway (2016), and Wood (2017).
A recent spmodel
release (version 0.4.0) provided
support for spatial generalized linear models (i.e., a generalized
linear mixed model with spatial random effects). We define the spatial
generalized linear model as \[\begin{equation}\label{eq:spglm}
f(\boldsymbol{\mu}) \equiv \mathbf{w} = \mathbf{X} \boldsymbol{\beta} +
\boldsymbol{\tau} + \boldsymbol{\epsilon},
\end{equation}\] where \(f(\cdot)\) is the link function, \(\boldsymbol{\mu}\) is the mean of \(\mathbf{y}\), and the remaining terms \(\mathbf{X}\), \(\boldsymbol{\beta}\), \(\boldsymbol{\tau}\), \(\boldsymbol{\epsilon}\) represent the same
quantities as for spatial linear models. That is, \(\text{E}(\boldsymbol{\tau}) =
\text{E}(\boldsymbol{\epsilon}) = \boldsymbol{0}\), \(\text{Cov}(\boldsymbol{\tau}) =
\sigma^2_{de}\mathbf{R}\), \(\text{Cov}(\boldsymbol{\epsilon}) = \sigma^2_{ie}
\mathbf{I}\), \(\boldsymbol{\tau}\) is independent of \(\boldsymbol{\epsilon}\), and \(\boldsymbol{\Sigma} = \sigma^2_{de} \mathbf{R} +
\sigma^2_{ie} \mathbf{I}\). The link function, \(f(\cdot)\), “links” a function of \(\boldsymbol{\mu}\) to the linear term \(\mathbf{w}\). Note that the linking of
\(\boldsymbol{\mu}\) to \(\mathbf{w}\) applies element-wise to each
vector. Each link function \(f(\cdot)\)
has a corresponding inverse link function, \(f^{-1}(\cdot)\), which “links” a function
of \(\mathbf{w}\) to \(\boldsymbol{\mu}\). Notice that \(\mathbf{w}\) is unconstrained but \(\boldsymbol{\mu}\) is usually constrained
in some way (e.g., positive) that depends on the distribution assumed
for \(\mathbf{y}\).
Spatial linear and generalized linear models rely on the same components: explanatory variables, fixed effects, spatially dependent random errors, and spatial independent random errors. These components are related to the response linearly (spatial linear models) or non-linearly (spatial generalized linear models). The response itself is generally assumed to follow a Gaussian distribution (spatial linear models) or one of many non-Gaussian distributions (spatial generalized linear models). Some non-Gaussian distributions include the binomial distribution for binary responses, the beta distribution for proportion responses, the Poisson or negative binomial distributions for count responses, and the gamma or inverse Gaussian distributions for skewed, positive continuous responses.
spmodel
implements a marginal approach to inference that
uses a novel application of the Laplace approximation. This marginal
approach is powerful and flexible, accommodating a wide range of
covariance structures. It is a frequentist approach, which means that it
does not rely on any Bayesian sampling routines, as is common when
handling complex hierarchical structures. It also formally maximizes a
likelihood, which is advantageous because it means that likelihood-based
statistics such as AIC (Akaike 1974; Hoeting et
al. 2006) and BIC (Schwarz 1978)
are defined, in contrast to approaches that only specify the first two
moments of a distribution (e.g., quasi-likelihood fit via iteratively
re-weighted least squares). Ver Hoef et al.
(2024) provide the methodology’s full details.
spmodel
accommodates the binomial, beta, Poisson,
negative binomial, gamma, and inverse Gaussian distributions. The
binomial and beta response distributions (specified by ) use the logit
link, while the Poisson, negative binomial, gamma, and inverse Gaussian
distributions use the log link – see the following table.
Family | Link Function | Link Name | Data Type |
---|---|---|---|
Binomial | \(f(\mathbf{\mu}) = \log(\mathbf{\mu} / (1 - \mathbf{\mu}))\) | Logit | Binary; Binary Count |
Poisson | \(f(\mathbf{\mu}) = \log(\mathbf{\mu})\) | Log | Count |
Negative Binomial | \(f(\mathbf{\mu}) = \log(\mathbf{\mu})\) | Log | Count |
Beta | \(f(\mathbf{\mu}) = \log(\mathbf{\mu} / (1 - \mathbf{\mu}))\) | Logit | Proportion |
Gamma | \(f(\mathbf{\mu}) = \log(\mathbf{\mu})\) | Log | Skewed |
Inverse Gaussian | \(f(\mathbf{\mu}) = \log(\mathbf{\mu})\) | Log | Skewed |
Spatial generalized linear models in spmodel
are fit
using (for point-referenced data) or (for areal data). These functions
are similar in structure both to one another and to the function in base
R. They generally require the following four arguments:
, a formula that describes the relationship between the response and
explanatory variables; , the response distribution, which can be , , , ,
, or ; , a or object (Pebesma 2018) whose
rows index observations and whose columns contain the response variable,
explanatory variables, and spatial information; and , one of 19 spatial
covariance types (e.g., spherical spatial covariance, Matérn spatial
covariance, simultaneous-autoregressive spatial covariance etc.). The
binomial and beta response distributions (specified by ) use the logit
link, while the Poisson, negative binomial, gamma, and inverse Gaussian
distributions use the log link. Other arguments to and are either
required only in specific situations or have default values specified.
All arguments to the and functions are summarized in the following
table. Models for several hundred observations can take anywhere from a
few seconds to a few minutes to fit, depending on the shape of the
likelihood surface and how quickly the covariance parameters converge.
Models for several thousand observations can utilize
spmodel
’s large data set options.
Argument | Purposes | Applies To |
---|---|---|
\(\texttt{formula}\) | Fixed effects formula | Both |
\(\texttt{family}\) | Response distribution | Both |
\(\texttt{data}\) | Data Frame or object | Both |
\(\texttt{spcov_type}\) | Spatial covariance type | Both |
\(\texttt{spcov_initial}\) | Initial or known spatial covariance parameters | Both |
\(\texttt{dispersion_initial}\) | Initial or known dispersion parameter | Both |
\(\texttt{estmethod}\) | Estimation method | Both |
\(\texttt{random}\) | Random effects formula | Both |
\(\texttt{randcov_initial}\) | Initial or known random effect variances | Both |
\(\texttt{partition_factor}\) | A partition factor formula | Both |
\(\texttt{...}\) | Additional arguments | Both |
\(\texttt{anisotropy}\) | Whether to model anisotropy | \(\texttt{spglm()}\) |
\(\texttt{xcoord}\) | An x-coordinate name | \(\texttt{spglm()}\) |
\(\texttt{ycoord}\) | A y-coordinate name | \(\texttt{spglm()}\) |
\(\texttt{local}\) | Options for large data sets | \(\texttt{spglm()}\) |
\(\texttt{W}\) | A neighborhood weight matrix | \(\texttt{spgautor()}\) |
\(\texttt{row_st}\) | Whether to row-standardize \(\texttt{W}\) | \(\texttt{spgautor()}\) |
\(\texttt{M}\) | A symmetry matrix | \(\texttt{spgautor()}\) |
\(\texttt{range_positive}\) | Whether the range parameter is restricted to be positive | |
\(\texttt{cutoff}\) | Distance cutoff if using distance-based neighbor definition |
After fitting a model, it is usually beneficial to evaluate certain
aspects of model fit or inspect and visualize model diagnostics like
residuals. There are many functions in spmodel
that are
used for these purposes, and their main argument is the fitted model
object returned by or . The names of these functions are meant to be
illustrative, describing the purpose of the function itself. The
functions for evaluating model fit (see below) return likelihood-based
statistics like AIC, AICc, BIC, and deviance (McCullagh and Nelder 1989), return coefficient
estimates and confidence intervals, perform cross validation, and
more.
Function | Purpose |
---|---|
\(\texttt{AIC()}\); \(\texttt{AICc()}\) | Return AIC and AICc |
\(\texttt{anova()}\) | Perform an analysis of variance |
\(\texttt{AUROC()}\) | Compute the area under the receiver operating characteristic curve |
\(\texttt{BIC()}\) | Return BIC |
\(\texttt{coef()}\); \(\texttt{coefficients()}\) | Return parameter estimates |
\(\texttt{confint()}\) | Return confidence intervals |
\(\texttt{covmatrix()}\) | Return the fitted covariance matrix |
\(\texttt{deviance()}\) | Return the deviance |
\(\texttt{glance()}\); \(\texttt{glances()}\) | Glance at model fits |
\(\texttt{logLik()}\) | Return the log-likelihood |
\(\texttt{loocv()}\) | Perform leave-one-out cross validation |
\(\texttt{pseudoR2()}\) | Return the pseudo R-squared |
\(\texttt{summary()}\) | Summarize the fitted model |
\(\texttt{tidy()}\) | Tidy the fitted model |
\(\texttt{varcomp()}\) | Compare sources of variability |
\(\texttt{vcov()}\) | Return variance-covariance matrices |
The functions for inspecting model diagnostics (see below) return Cook’s distances (Cook 1979), fitted values, leverage (hat) values, residuals (of several types), and visualizations.
Function | Purpose |
---|---|
\(\texttt{augment()}\) | Augment model data with diagnostics |
\(\texttt{cooks.distance()}\) | Return Cook’s distances |
\(\texttt{fitted()}\); \(\texttt{fitted.values()}\) | Return fitted values |
\(\texttt{hatvalues()}\); | Return hat (leverage) values |
\(\texttt{plot()}\) | Plot model diagnostics |
\(\texttt{residuals()}\); \(\texttt{resid()}\); \(\texttt{rstandard()}\) | Return residuals |
The functions , , and , popularized by the broom
R package (Robinson, Hayes, and
Couch 2021), are particularly useful. The function tidies
parameter estimate output, returning a special called a (Müller and Wickham 2021) that is much easier to
manipulate and work with than the parameter estimate output printed by
to the R console. The function glances at the model
fit, returning a with the sample size, number of estimated parameters,
and several model-fit statistics. is an extension of that glances at
multiple models simultaneously. The function augments the model fit,
returning a with the data used to fit the model as well as model
diagnostics. The function can also be used to augment prediction data,
as we discuss next.
Often a priority of a spatial data analysis is using the fitted model
to make predictions at unobserved locations and quantify uncertainties
in those predictions. The function in spmodel
is used to
predict the mean process at unobserved locations. It is similar in
structure to for objects fit using in base R and takes
two main arguments: , the fitted model object returned by or ; and , a
or object whose rows index locations requiring prediction and whose
columns contain the explanatory variables and spatial information at
these locations. Other arguments to are either required only in specific
situations or have default values specified. All arguments to the
function are summarized in the following table. The function can also be
used to augment with predictions, taking the same arguments as .
Argument | Purpose |
---|---|
\(\texttt{object}\) | Fitted model object |
\(\texttt{newdata}\) | New data frame or object |
\(\texttt{type}\) | Whether predictions are on the link or response scale |
\(\texttt{se.fit}\) | Whether standard errors should be returned |
\(\texttt{interval}\) | Interval type (prediction or confidence) |
\(\texttt{level}\) | Prediction or confidence level |
\(\texttt{local}\) | Options for large data sets |
\(\texttt{var_correct}\) | Whether prediction variances should be adjusted |
\(\texttt{...}\) | Additional arguments |
spmodel
has functions that are used to simulate spatial
data from Gaussian, binomial, beta, Poisson, negative binomial, gamma,
and inverse Gaussian response distributions. These simulation functions
are similar in structure to commonly used base R
functions for simulating data. They are also similar in name, simply
adding an prefix to the conventional name in base R.
For example, is used to simulate Poisson data in base
R, while is used to simulate spatial Poisson data in
spmodel
. These functions are summarized in the following
table.
Function | Purpose |
---|---|
\(\texttt{sprnorm()}\) | Simulate spatial Gaussian data |
\(\texttt{sprbinom()}\) | Simulate spatial binomial data |
\(\texttt{sprbeta()}\) | Simulate spatial beta data |
\(\texttt{sprpois()}\) | Simulate spatial Poisson data |
\(\texttt{sprnbinom()}\) | Simulate spatial negative binomial data |
\(\texttt{sprgamma()}\) | Simulate spatial gamma data |
\(\texttt{sprinvgauss()}\) | Simulate spatial inverse Gaussian data |
In the next two sections we to show how to use spmodel
to analyze data collected on moose in the Togiak region of Alaska, USA.
The data in spmodel
contain information on moose counts
from a survey of sites in the Togiak region. The data are stored as an
sf
object, a special built to store spatial data. has
several variables: ; a stratification variable with two levels ( for low
and for medium) based on surrounding landscape metrics; , the site
elevation; , whether or not at least one moose was observed at the site;
, the number of moose observed at a the site; and , the spatial
coordinates in an Alaska Albers projection (EPSG: 3338). We load
spmodel
and the data by running
We are interested in quantifying the effects of and on moose and begin with some visualizations. First, we visualize the presence for observed sites:
ggplot(moose, aes(color = presence)) +
geom_sf(size = 2) +
scale_color_viridis_d(option = "H") +
theme_gray(base_size = 14)
There is spatial patterning in the distribution of moose presence, as moose tend to be present in the eastern and southwestern parts of the domain. This spatial pattern seems to have a directional orientation, seemingly strongest in the northwest-to-southeast direction.
Next, we visualize the elevation for each strata:
ggplot(moose, aes(color = elev)) +
geom_sf(size = 1) +
facet_grid(~ strat) +
scale_color_viridis_c() +
scale_y_continuous(breaks = seq(62.6, 63.6, length.out = 5)) +
scale_x_continuous(breaks = seq(-148.5, -146, length.out = 3)) +
theme_gray(base_size = 12)
There are no obvious spatial patterns in the assignment of sites to strata, while elevation is highest on the boundaries of the domain and lowest in the middle of the domain. There is spatial patterning in the distribution of moose presence, as moose tend to be present in the eastern and southwestern parts of the domain. This spatial pattern seems to have a directional orientation, seemingly strongest in the northwest-to-southeast direction.
To quantify the effects of , , and spatial dependence on moose presence, we fit three binomial (i.e., ) logistic regression models using :
# model one
bin_mod <- spglm(
formula = presence ~ elev + strat + elev:strat,
family = "binomial",
data = moose,
spcov_type = "none"
)
# model two
bin_spmod <- spglm(
formula = presence ~ elev + strat + elev:strat,
family = "binomial",
data = moose,
spcov_type = "spherical"
)
# model three
bin_spmod_anis <- spglm(
formula = presence ~ elev + strat + elev:strat,
family = "binomial",
data = moose,
spcov_type = "spherical",
anisotropy = TRUE
)
All three models have the same fixed effect structure, using elevation, strata, and their interaction to explain moose presence. The three models vary in their covariance structure. The first model, , has no spatial covariance (\(\sigma^2_{de} = 0\)). The second model, has a spherical spatial covariance. The third model, , has a spherical spatial covariance that incorporates geometric anisotropy. Geometric anisotropy allows the spatial covariance to vary with direction by evaluating the spatial covariance with \(\mathbf{H}^*\) (instead of \(\mathbf{H}\)), a matrix of distances between transformed coordinates that are rotated and scaled appropriately (Schabenberger and Gotway 2017). We evaluate the three models using AIC by running
AIC(bin_mod, bin_spmod, bin_spmod_anis)
#> df AIC
#> bin_mod 1 719.1628
#> bin_spmod 3 686.9031
#> bin_spmod_anis 5 677.9403
The spatial models (, ) have a much lower AIC than the non-spatial model (), which suggests that the models benefit from incorporating spatial dependence. has a lower AIC than , which suggests that the model benefits from incorporating directionality in the spatial dependence. Next we inspect and later use it to predict moose presence probability at unobserved sites.
We summarize using :
summary(bin_spmod_anis)
#>
#> Call:
#> spglm(formula = presence ~ elev + strat + elev:strat, family = "binomial",
#> data = moose, spcov_type = "spherical", anisotropy = TRUE)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.7970 -0.7348 0.3291 0.6855 1.7384
#>
#> Coefficients (fixed):
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -3.112818 1.522362 -2.045 0.04088 *
#> elev 0.010990 0.004655 2.361 0.01824 *
#> stratM 3.820320 1.176127 3.248 0.00116 **
#> elev:stratM -0.014020 0.006933 -2.022 0.04316 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Pseudo R-squared: 0.1119
#>
#> Coefficients (spherical spatial covariance):
#> de ie range rotate scale
#> 6.580e+00 2.220e-03 1.791e+05 2.718e+00 2.154e-01
#> attr(,"class")
#> [1] "spherical"
#>
#> Coefficients (Dispersion for binomial family):
#> dispersion
#> 1
The output from an model is very similar to the output from a base-R model, returning the original function call, deviance residuals, a fixed effects coefficients table, a pseudo R-squared (which quantifies the amount of variability explained by the fixed effects), and covariance parameter coefficients. The fixed effects coefficients table provides some evidence that elevation and strata are associated with with moose presence and that the effect of elevation varies across the two strata (all \(p~\)values \(<\) 0.05).
The area under the receiver receiver operating characteristic (AUROC) curve quantifies the effectiveness of a classifier (Robin et al. 2011). It ranges from zero to one, and higher values indicate better model performance. The AUROC of is
AUROC(bin_spmod_anis)
#> [1] 0.9403199
The function returns graphics of several model diagnostics. Running
shows that the spatial dependence is evident and appears strongest in the northwest-to-southeast direction. These findings are supported by the clear spatial patterns in the data.
Recall that the , , and functions are particularly useful tools for manipulating and understanding fitted model objects. The function tidies model output, returning a of parameter estimates (and confidence intervals):
tidy(bin_spmod_anis, conf.int = TRUE)
#> # A tibble: 4 × 7
#> term estimate std.error statistic p.value conf.low conf.high
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -3.11 1.52 -2.04 0.0409 -6.10 -0.129
#> 2 elev 0.0110 0.00466 2.36 0.0182 0.00187 0.0201
#> 3 elev:stratM -0.0140 0.00693 -2.02 0.0432 -0.0276 -0.000431
#> 4 stratM 3.82 1.18 3.25 0.00116 1.52 6.13
By default, the fixed effects estimates are returned, but covariance parameter estimates are returned via the argument:
tidy(bin_spmod_anis, effects = "spcov")
#> # A tibble: 5 × 3
#> term estimate is_known
#> <chr> <dbl> <lgl>
#> 1 de 6.58 FALSE
#> 2 ie 0.00222 FALSE
#> 3 range 179082. FALSE
#> 4 rotate 2.72 FALSE
#> 5 scale 0.215 FALSE
The column indicates whether covariance parameters were assumed known, which is possible to specify using the argument to . Here, all parameters were assumed unknown and then estimated.
The function glances at model fit, returning a of model fit statistics:
glance(bin_spmod_anis)
#> # A tibble: 1 × 10
#> n p npar value AIC AICc BIC logLik deviance pseudo.r.squared
#> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 218 4 5 668. 678. 678. 695. -334. 161. 0.112
can be used to glance at multiple models simultaneously and sorts models by ascending AICc:
glances(bin_mod, bin_spmod, bin_spmod_anis)
#> # A tibble: 3 × 11
#> model n p npar value AIC AICc BIC logLik deviance
#> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 bin_spmod_anis 218 4 5 668. 678. 678. 695. -334. 161.
#> 2 bin_spmod 218 4 3 681. 687. 687. 697. -340. 161.
#> 3 bin_mod 218 4 1 717. 719. 719. 723. -359. 294.
#> # ℹ 1 more variable: pseudo.r.squared <dbl>
The function augments the data with model diagnostics:
aug_mod <- augment(bin_spmod_anis)
aug_mod
#> Simple feature collection with 218 features and 8 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 269085 ymin: 1416151 xmax: 419057.4 ymax: 1541016
#> Projected CRS: NAD83 / Alaska Albers
#> # A tibble: 218 × 9
#> presence elev strat .fitted .resid .hat .cooksd .std.resid
#> * <fct> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 469. L -1.91 -0.527 0.0923 0.00777 -0.553
#> 2 0 362. L -3.58 -0.235 0.00973 0.000137 -0.236
#> 3 0 173. M -3.49 -0.246 0.00136 0.0000205 -0.246
#> 4 0 280. L -4.41 -0.155 0.00187 0.0000113 -0.155
#> 5 0 620. L -0.964 -0.804 0.360 0.142 -1.01
#> 6 0 164. M -2.64 -0.372 0.00286 0.0000996 -0.373
#> 7 0 164. M -2.24 -0.449 0.00397 0.000202 -0.450
#> 8 0 186. L -2.51 -0.396 0.00486 0.000192 -0.397
#> 9 0 362. L -2.73 -0.355 0.0211 0.000692 -0.359
#> 10 0 430. L -1.62 -0.600 0.0871 0.00941 -0.628
#> # ℹ 208 more rows
#> # ℹ 1 more variable: geometry <POINT [m]>
where are fitted values (the estimated \(f^{-1}(\mathbf{w})\), or moose presence probability), are response residuals, values indicate leverage, values indicate Cook’s distance (i.e., influence), and are standardized residuals.
The data contains survey sites that were not sampled. We can use to make predictions of the underlying probabilities of moose presence at these sites using or . and return the same predictions but augments the prediction data with the predictions:
data("moose_preds")
spmod_anis_preds <- predict(
bin_spmod_anis,
newdata = moose_preds,
type = "response",
interval = "prediction"
)
head(spmod_anis_preds)
#> fit lwr upr
#> 1 0.53625613 0.150268897 0.8831963
#> 2 0.45236331 0.106721924 0.8509940
#> 3 0.09535359 0.010375867 0.5144805
#> 4 0.10772261 0.022873507 0.3837176
#> 5 0.83772891 0.466479245 0.9682358
#> 6 0.03276524 0.003867679 0.2281264
aug_pred <- augment(
bin_spmod_anis,
newdata = moose_preds,
type.predict = "response",
interval = "prediction"
)
aug_pred
#> Simple feature collection with 100 features and 5 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 269386.2 ymin: 1418453 xmax: 419976.2 ymax: 1541763
#> Projected CRS: NAD83 / Alaska Albers
#> # A tibble: 100 × 6
#> elev strat .fitted .lower .upper geometry
#> * <dbl> <chr> <dbl> <dbl> <dbl> <POINT [m]>
#> 1 143. L 0.536 0.150 0.883 (401239.6 1436192)
#> 2 324. L 0.452 0.107 0.851 (352640.6 1490695)
#> 3 158. L 0.0954 0.0104 0.514 (360954.9 1491590)
#> 4 221. M 0.108 0.0229 0.384 (291839.8 1466091)
#> 5 209. M 0.838 0.466 0.968 (310991.9 1441630)
#> 6 218. L 0.0328 0.00387 0.228 (304473.8 1512103)
#> 7 127. L 0.0381 0.00511 0.234 (339011.1 1459318)
#> 8 122. L 0.0747 0.00983 0.396 (342827.3 1463452)
#> 9 191 L 0.375 0.0410 0.893 (284453.8 1502837)
#> 10 105. L 0.142 0.0197 0.577 (391343.9 1483791)
#> # ℹ 90 more rows
Estimated moose presence probability for the data (obtained via ) and predicted moose presence probability for the data (obtained via or ) are overlain below and share similar patterns:
aug_mod$type <- "mod"
aug_pred$type <- "pred"
keep_cols <- c(".fitted", "type")
aug_combined <- rbind(aug_mod[, keep_cols], aug_pred[, keep_cols])
ggplot(aug_combined, aes(color = .fitted, shape = type)) +
geom_sf(size = 2) +
scale_color_viridis_c(option = "H") +
theme_gray(base_size = 14)
The data also contain the number of moose observed at each site:
ggplot(moose, aes(color = count)) +
geom_sf(size = 2) +
scale_color_viridis_c(option = "H") +
theme_gray(base_size = 14)
The moose counts are similarly distributed as moose presence, highest
in the eastern and southwestern parts of the domain. We compare two
count models, Poisson and negative binomial. The Poisson model assumes
the underlying process generating the counts has the same mean and
variance while the negative binomial model allows for overdispersion
(i.e., the variance is greater than the mean). We are interested in
quantifying the effects of and on moose count
, albeit using
a slightly different approach than we did for moose
presence
. Here, we will model elevation as a fixed effect
and allow this effect to change between strata, but we will model strata
as a random effect to highlight additional flexibility of the
spmodel
package. We fit relevant Poisson and negative
binomial models with a Matérn spatial covariance using :
count_mod_pois <- spglm(
count ~ elev + elev:strat,
family = "poisson",
data = moose,
spcov_type = "matern",
random = ~ (1 | strat)
)
count_mod_nb <- spglm(
count ~ elev + elev:strat,
family = "nbinomial",
data = moose,
spcov_type = "matern",
random = ~ (1 | strat)
)
Random effects are specified in spmodel
via the argument
using similar syntax as the commonly used lme4
(Bates et al. 2015) and nlme
(Pinheiro and Bates 2006) R
packages for non-spatial mixed models. In spmodel
, is
short-hand for . Random effects alter the covariance structure of the
model, building additional correlation into the model for sites sharing
a level of the random effect (here, sites within the same strata). More
formally, when incorporating a random effect, the spatial generalized
linear model becomes \[\begin{equation}\label{eq:spglm-rand}
f(\boldsymbol{\mu}) \equiv \mathbf{w} = \mathbf{X} \boldsymbol{\beta} +
\mathbf{Z}\mathbf{v} + \boldsymbol{\tau} + \boldsymbol{\epsilon},
\end{equation}\] where \(\mathbf{Z}\) is a design matrix that
indexes the random effects, \(\mathbf{v}\), and \(\text{Cov}(\mathbf{v}) =
\sigma^2_{v}\mathbf{I}\). Then the covariance matrix, \(\boldsymbol{\Sigma}\), becomes \(\sigma^2_v \mathbf{Z} \mathbf{Z}^\top +
\sigma^2_{de}\mathbf{R} + \sigma^2_{ie}\mathbf{I}\).
Previously we compared models using , but another way to compare models is by leave-one-out cross validation (Hastie et al. 2009). In leave-one-out cross validation, each observation is held-out and the model is re-fit and used to predict the held-out observation. Then a loss statistic is computed that compares each prediction to its true value. We calculate mean-squared-prediction error (MSPE) using leave-one-out cross validation for both the Poisson and negative binomial model by running
loocv(count_mod_pois)
#> # A tibble: 1 × 3
#> bias MSPE RMSPE
#> <dbl> <dbl> <dbl>
#> 1 1.37 32.0 5.66
loocv(count_mod_nb)
#> # A tibble: 1 × 3
#> bias MSPE RMSPE
#> <dbl> <dbl> <dbl>
#> 1 0.291 27.1 5.20
The negative binomial model has a lower MSPE, which suggests that incorporating overdispersion improves model fit here. Tidying this fitted model we see that there is some evidence elevation is related to moose counts and that this effect changes across strata (all \(p~\)values \(< 0.1\)):
tidy(count_mod_nb)
#> # A tibble: 3 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -0.301 1.50 -0.200 0.841
#> 2 elev 0.00590 0.00300 1.96 0.0494
#> 3 elev:stratM -0.00717 0.00408 -1.76 0.0789
The function in spmodel
apportions model variability
into fixed and random components:
varcomp(count_mod_nb)
#> # A tibble: 4 × 2
#> varcomp proportion
#> <chr> <dbl>
#> 1 Covariates (PR-sq) 0.0329
#> 2 de 0.498
#> 3 ie 0.00174
#> 4 1 | strat 0.467
Most model variability is explained by the spatially dependent random error () and the random effect for strata (). Running
shows observations of high influence or leverage and that the standardized residuals tend to be spread out around zero.
Estimated mean moose counts for the data (obtained via ) and predicted mean moose counts for the data (obtained via or ) share similiar patterns and are overlain by running
aug_mod <- augment(count_mod_nb)
aug_pred <- augment(count_mod_nb, newdata = moose_preds, type.predict = "response")
aug_mod$type <- "mod"
aug_pred$type <- "pred"
keep_cols <- c(".fitted", "type")
aug_combined <- rbind(aug_mod[, keep_cols], aug_pred[, keep_cols])
ggplot(aug_combined, aes(color = .fitted, shape = type)) +
geom_sf(size = 2) +
scale_color_viridis_c(option = "H") +
theme_gray(base_size = 18)
Throughout this vignette, we have shown several features
spmodel
offers, including a novel application of the
Laplace approximation, similarity to base R’s function,
over a dozen spatial covariance functions, a variety of tools available
to evaluate models, inspect model diagnostics, and make predictions
using ubiquitous base R functions (e.g., , , and ) and
more. Spatial generalized linear models for point-referenced data (i.e.,
generalized geostatistical models) are fit using the function. Spatial
generalized linear models for areal data (i.e., generalized spatial
autoregressive models) are fit using the function. Both functions share
common structure and syntax. Spatial data are simulated in
spmodel
by adding an prefix to commonly used base
R simulation functions (e.g., ).
We appreciate feedback from users regarding spmodel
. To
learn more about how to provide feedback or contribute to
spmodel
, please visit our GitHub repository at https://github.com/USEPA/spmodel.