Fit spatial generalized linear models for point-referenced data (i.e., generalized geostatistical models) using a variety of estimation methods, allowing for random effects, anisotropy, partition factors, and big data methods.
Usage
spglm(
formula,
family,
data,
spcov_type,
xcoord,
ycoord,
spcov_initial,
dispersion_initial,
estmethod = "reml",
anisotropy = FALSE,
random,
randcov_initial,
partition_factor,
local,
...
)
Arguments
- formula
A two-sided linear formula describing the fixed effect structure of the model, with the response to the left of the
~
operator and the terms on the right, separated by+
operators.- family
The generalized linear model family describing the distribution of the response variable to be used. Available options
"poisson"
,"nbinomial"
,"binomial"
,"beta"
,"Gamma"
, and"inverse.gaussian"
. Can be quoted or unquoted. Note that thefamily
argument only takes a single value, rather than the list structure used by stats::glm. See Details for more.- data
A data frame or
sf
object object that contains the variables infixed
,random
, andpartition_factor
as well as geographical information. If ansf
object is provided withPOINT
geometries, the x-coordinates and y-coordinates are used directly. If ansf
object is provided withPOLYGON
geometries, the x-coordinates and y-coordinates are taken as the centroids of each polygon.- spcov_type
The spatial covariance type. Available options include
"exponential"
,"spherical"
,"gaussian"
,"triangular"
,"circular"
,"cubic"
,"pentaspherical"
,"cosine"
,"wave"
,"jbessel"
,"gravity"
,"rquad"
,"magnetic"
,"matern"
,"cauchy"
,"pexponential"
, and"none"
. Parameterizations of each spatial covariance type are available in Details. Multiple spatial covariance types can be provided as a character vector, and thenspglm()
is called iteratively for each element and a list is returned for each model fit. The default forspcov_type
is"exponential"
. Whenspcov_type
is specified, all unknown spatial covariance parameters are estimated.spcov_type
is ignored ifspcov_initial
is provided.- xcoord
The name of the column in
data
representing the x-coordinate. Can be quoted or unquoted. Not required ifdata
is ansf
object.- ycoord
The name of the column in
data
representing the y-coordinate. Can be quoted or unquoted. Not required ifdata
is ansf
object.- spcov_initial
An object from
spcov_initial()
specifying initial and/or known values for the spatial covariance parameters. Multiplespcov_initial()
objects can be provided in a list. Thenspglm()
is called iteratively for each element and a list is returned for each model fit.- dispersion_initial
An object from
dispersion_initial()
specifying initial and/or known values for the dispersion parameter for the"nbinomial"
,"beta"
,"Gamma"
, and"inverse.gaussian"
families.family
is ignored ifdispersion_initial
is provided.- estmethod
The estimation method. Available options include
"reml"
for restricted maximum likelihood and"ml"
for maximum likelihood. The default is"reml"
.- anisotropy
A logical indicating whether (geometric) anisotropy should be modeled. Not required if
spcov_initial
is provided with 1)rotate
assumed unknown or assumed known and non-zero or 2)scale
assumed unknown or assumed known and less than one. Whenanisotropy
isTRUE
, computational times can significantly increase. The default isFALSE
.- random
A one-sided linear formula describing the random effect structure of the model. Terms are specified to the right of the
~ operator
. Each term has the structurex1 + ... + xn | g1/.../gm
, wherex1 + ... + xn
specifies the model for the random effects andg1/.../gm
is the grouping structure. Separate terms are separated by+
and must generally be wrapped in parentheses. Random intercepts are added to each model implicitly when at least one other variable is defined. If a random intercept is not desired, this must be explicitly defined (e.g.,x1 + ... + xn - 1 | g1/.../gm
). If only a random intercept is desired for a grouping structure, the random intercept must be specified as1 | g1/.../gm
. Note thatg1/.../gm
is shorthand for(1 | g1/.../gm)
. If only random intercepts are desired and the shorthand notation is used, parentheses can be omitted.- randcov_initial
An optional object specifying initial and/or known values for the random effect variances.
- partition_factor
A one-sided linear formula with a single term specifying the partition factor. The partition factor assumes observations from different levels of the partition factor are uncorrelated.
- local
An optional logical or list controlling the big data approximation. If omitted,
local
is set toTRUE
orFALSE
based on the sample size (the number of non-missing observations indata
) -- if the sample size exceeds 3,000,local
is set toTRUE
. Otherwise it is set toFALSE
. IfFALSE
, no big data approximation is implemented. If a list is provided, the following arguments detail the big data approximation:index:
The group indexes. Observations in different levels ofindex
are assumed to be uncorrelated for the purposes of estimation. Ifindex
is not provided, it is determined by specifyingmethod
and eithersize
orgroups
.method
: The big data approximation method used to determineindex
. Ignored ifindex
is provided. Ifmethod = "random"
, observations are randomly assigned toindex
based onsize
. Ifmethod = "kmeans"
, observations assigned toindex
based on k-means clustering on the coordinates withgroups
clusters. The default is"kmeans"
. Note that both methods have a random component, which means that you may get different results from separate model fitting calls. To ensure consistent results, specifyindex
or set a seed viabase::set.seed()
.size
: The number of observations in eachindex
group whenmethod
is"random"
. If the number of observations is not divisible bysize
, some levels getsize - 1
observations. The default is 100.groups:
The number ofindex
groups. Ifmethod
is"random"
,size
is \(ceiling(n / groups)\), where \(n\) is the sample size. Automatically determined ifsize
is specified. Ifmethod
is"kmeans"
,groups
is the number of clusters.var_adjust:
The approach for adjusting the variance-covariance matrix of the fixed effects."none"
for no adjustment,"theoretical"
for the theoretically-correct adjustment,"pooled"
for the pooled adjustment, and"empirical"
for the empirical adjustment. The default is"theoretical"
.parallel
: IfTRUE
, parallel processing via the parallel package is automatically used. The default isFALSE
.ncores
: Ifparallel = TRUE
, the number of cores to parallelize over. The default is the number of available cores on your machine.
When
local
is a list, at least one list element must be provided to initialize default arguments for the other list elements. Iflocal
isTRUE
, defaults forlocal
are chosen such thatlocal
is transformed intolist(size = 100, method = "kmeans", var_adjust = "theoretical", parallel = FALSE)
.- ...
Other arguments to
esv()
orstats::optim()
.
Value
A list with many elements that store information about
the fitted model object. If spcov_type
or spcov_initial
are
length one, the list has class spglm
. Many generic functions that
summarize model fit are available for spglm
objects, including
AIC
, AICc
, anova
, augment
, coef
,
cooks.distance
, covmatrix
, deviance
, fitted
, formula
,
glance
, glances
, hatvalues
, influence
,
labels
, logLik
, loocv
, model.frame
, model.matrix
,
plot
, predict
, print
, pseudoR2
, summary
,
terms
, tidy
, update
, varcomp
, and vcov
. If
spcov_type
or spcov_initial
are length greater than one, the
list has class spglm_list
and each element in the list has class
spglm
. glances
can be used to summarize spglm_list
objects, and the aforementioned spglm
generics can be used on each
individual list element (model fit).
Details
The spatial generalized linear model for point-referenced data (i.e., generalized geostatistical model) can be written as \(g(\mu) = \eta = X \beta + \tau + \epsilon\), where \(\mu\) is the expectation of the response (\(y\)) given the random errors, \(g(.)\) is called a link function which links together the \(\mu\) and \(\eta\), \(X\) is the fixed effects design matrix, \(\beta\) are the fixed effects, \(\tau\) is random error that is spatially dependent, and \(\epsilon\) is random error that is spatially independent.
There are six generalized linear model
families available: poisson
assumes \(y\) is a Poisson random variable
nbinomial
assumes \(y\) is a negative binomial random
variable, binomial
assumes \(y\) is a binomial random variable,
beta
assumes \(y\) is a beta random variable,
Gamma
assumes \(y\) is a gamma random
variable, and inverse.gaussian
assumes \(y\) is an inverse Gaussian
random variable.
The supports for \(y\) for each family are given below:
family: support of \(y\)
poisson: \(0 \le y\); \(y\) an integer
nbinomial: \(0 \le y\); \(y\) an integer
binomial: \(0 \le y\); \(y\) an integer
beta: \(0 < y < 1\)
Gamma: \(0 < y\)
inverse.gaussian: \(0 < y\)
The generalized linear model families and the parameterizations of their link functions are given below:
family: link function
poisson: \(g(\mu) = log(\eta)\) (log link)
nbinomial: \(g(\mu) = log(\eta)\) (log link)
binomial: \(g(\mu) = log(\eta / (1 - \eta))\) (logit link)
beta: \(g(\mu) = log(\eta / (1 - \eta))\) (logit link)
Gamma: \(g(\mu) = log(\eta)\) (log link)
inverse.gaussian: \(g(\mu) = log(\eta)\) (log link)
The variance function of an individual \(y\) (given \(\mu\)) for each generalized linear model family is given below:
family: \(Var(y)\)
poisson: \(\mu \phi\)
nbinomial: \(\mu + \mu^2 / \phi\)
binomial: \(n \mu (1 - \mu) \phi\)
beta: \(\mu (1 - \mu) / (1 + \phi)\)
Gamma: \(\mu^2 / \phi\)
inverse.gaussian: \(\mu^2 / \phi\)
The parameter \(\phi\) is a dispersion parameter that influences \(Var(y)\).
For the poisson
and binomial
families, \(\phi\) is always
one. Note that this inverse Gaussian parameterization is different than a
standard inverse Gaussian parameterization, which has variance \(\mu^3 / \lambda\).
Setting \(\phi = \lambda / \mu\) yields our parameterization, which is
preferred for computational stability. Also note that the dispersion parameter
is often defined in the literature as \(V(\mu) \phi\), where \(V(\mu)\) is the variance
function of the mean. We do not use this parameterization, which is important
to recognize while interpreting dispersion estimates.
For more on generalized linear model constructions, see McCullagh and
Nelder (1989).
Together, \(\tau\) and \(\epsilon\) are modeled using a spatial covariance function, expressed as \(de * R + ie * I\), where \(de\) is the dependent error variance, \(R\) is a correlation matrix that controls the spatial dependence structure among observations, \(ie\) is the independent error variance, and \(I\) is an identity matrix. Recall that \(\tau\) and \(\epsilon\) are modeled on the link scale, not the inverse link (response) scale. Random effects are also modeled on the link scale.
spcov_type
Details: Parametric forms for \(R\) are given below, where \(\eta = h / range\)
for \(h\) distance between observations:
exponential: \(exp(- \eta )\)
spherical: \((1 - 1.5\eta + 0.5\eta^3) * I(h <= range)\)
gaussian: \(exp(- \eta^2 )\)
triangular: \((1 - \eta) * I(h <= range)\)
circular: \((1 - (2 / \pi) * (m * sqrt(1 - m^2) + sin^{-1}(m))) * I(h <= range), m = min(\eta, 1)\)
cubic: \((1 - 7\eta^2 + 8.75\eta^3 - 3.5\eta^5 + 0.75\eta^7) * I(h <= range)\)
pentaspherical: \((1 - 1.875\eta + 1.25\eta^3 - 0.375\eta^5) * I(h <= range)\)
cosine: \(cos(\eta)\)
wave: \(sin(\eta) / \eta * I(h > 0) + I(h = 0)\)
jbessel: \(Bj(h * range)\), Bj is Bessel-J function
gravity: \((1 + \eta^2)^{-0.5}\)
rquad: \((1 + \eta^2)^{-1}\)
magnetic: \((1 + \eta^2)^{-1.5}\)
matern: \(2^{1 - extra}/ \Gamma(extra) * \alpha^{extra} * Bk(\alpha, extra)\), \(\alpha = (2extra * \eta)^{0.5}\), Bk is Bessel-K function with order \(1/5 \le extra \le 5\)
cauchy: \((1 + \eta^2)^{-extra}\), \(extra > 0\)
pexponential: \(exp(h^{extra}/range)\), \(0 < extra \le 2\)
none: \(0\)
All spatial covariance functions are valid in one spatial dimension. All
spatial covariance functions except triangular
and cosine
are
valid in two dimensions.
estmethod
Details: The various estimation methods are
reml
: Maximize the restricted log-likelihood.ml
: Maximize the log-likelihood.
Note that the likelihood being optimized is obtained using the Laplace approximation.
anisotropy
Details: By default, all spatial covariance parameters except rotate
and scale
as well as all random effect variance parameters
are assumed unknown, requiring estimation. If either rotate
or scale
are given initial values other than 0 and 1 (respectively) or are assumed unknown
in spcov_initial()
, anisotropy
is implicitly set to TRUE
.
(Geometric) Anisotropy is modeled by transforming a covariance function that
decays differently in different directions to one that decays equally in all
directions via rotation and scaling of the original coordinates. The rotation is
controlled by the rotate
parameter in \([0, \pi]\) radians. The scaling
is controlled by the scale
parameter in \([0, 1]\). The anisotropy
correction involves first a rotation of the coordinates clockwise by rotate
and then a
scaling of the coordinates' minor axis by the reciprocal of scale
. The spatial
covariance is then computed using these transformed coordinates.
random
Details: If random effects are used, the model
can be written as \(y = X \beta + Z1u1 + ... Zjuj + \tau + \epsilon\),
where each Z is a random effects design matrix and each u is a random effect.
partition_factor
Details: The partition factor can be represented in matrix form as \(P\), where
elements of \(P\) equal one for observations in the same level of the partition
factor and zero otherwise. The covariance matrix involving only the
spatial and random effects components is then multiplied element-wise
(Hadmard product) by \(P\), yielding the final covariance matrix.
local
Details: The big data approximation works by sorting observations into different levels
of an index variable. Observations in different levels of the index variable
are assumed to be uncorrelated for the purposes of model fitting. Sparse matrix methods are then implemented
for significant computational gains. Parallelization generally further speeds up
computations when data sizes are larger than a few thousand. Both the "random"
and "kmeans"
values of method
in local
have random components. That means you may get slightly different
results when using the big data approximation and rerunning spglm()
with the same code. For consistent results,
either set a seed via base::set.seed()
or specify index
to local
.
Observations with NA
response values are removed for model
fitting, but their values can be predicted afterwards by running
predict(object)
.
Note
This function does not perform any internal scaling. If optimization is not stable due to large extremely large variances, scale relevant variables so they have variance 1 before optimization.
References
McCullagh P. and Nelder, J. A. (1989) Generalized Linear Models. London: Chapman and Hall.
Examples
spgmod <- spglm(presence ~ elev,
family = "binomial", data = moose,
spcov_type = "exponential"
)
summary(spgmod)
#>
#> Call:
#> spglm(formula = presence ~ elev, family = "binomial", data = moose,
#> spcov_type = "exponential")
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.5249 -0.8114 0.5600 0.8306 1.5757
#>
#> Coefficients (fixed):
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.874038 1.140953 -0.766 0.444
#> elev 0.002365 0.003184 0.743 0.458
#>
#> Pseudo R-squared: 0.00311
#>
#> Coefficients (exponential spatial covariance):
#> de ie range
#> 3.746e+00 4.392e-03 3.203e+04
#>
#> Coefficients (Dispersion for binomial family):
#> dispersion
#> 1