Fit spatial linear models for point-referenced data (i.e., geostatistical models) using a variety of estimation methods, allowing for random effects, anisotropy, partition factors, and big data methods.
splm(
formula,
data,
spcov_type,
xcoord,
ycoord,
spcov_initial,
estmethod = "reml",
weights = "cressie",
anisotropy = FALSE,
random,
randcov_initial,
partition_factor,
local,
range_constrain,
...
)
A two-sided linear formula describing the fixed effect structure
of the model, with the response to the left of the ~
operator and
the terms on the right, separated by +
operators.
A data frame or sf
object object that contains
the variables in fixed
, random
, and partition_factor
as well as geographical information. If an sf
object is
provided with POINT
geometries, the x-coordinates and y-coordinates
are used directly. If an sf
object is
provided with POLYGON
geometries, the x-coordinates and y-coordinates
are taken as the centroids of each polygon.
The spatial covariance type. Available options include
"exponential"
, "spherical"
, "gaussian"
,
"triangular"
, "circular"
, "cubic"
,
"pentaspherical"
, "cosine"
, "wave"
,
"jbessel"
, "gravity"
, "rquad"
,
"magnetic"
, "matern"
, "cauchy"
, "pexponential"
,
and "none"
. Parameterizations of each spatial covariance type are
available in Details. Multiple spatial covariance types can be provided as
a character vector, and then splm()
is called iteratively for each
element and a list is returned for each model fit. The default for
spcov_type
is "exponential"
. When spcov_type
is
specified, all unknown spatial covariance parameters are estimated.
spcov_type
is ignored if spcov_initial
is provided.
The name of the column in data
representing the x-coordinate.
Can be quoted or unquoted. Not required if data
is an sf
object.
The name of the column in data
representing the y-coordinate.
Can be quoted or unquoted. Not required if data
is an sf
object.
An object from spcov_initial()
specifying initial and/or
known values for the spatial covariance parameters. Multiple spcov_initial()
objects can be provided in a list. Then splm()
is called iteratively
for each element and a list is returned for each model fit.
The estimation method. Available options include
"reml"
for restricted maximum likelihood, "ml"
for maximum
likelihood, "sv-wls"
for semivariogram weighted least squares,
and "sv-cl"
for semivariogram composite likelihood. The default is
"reml"
.
Weights to use when estmethod
is "sv-wls"
. Available
options include "cressie"
, "cressie-dr"
, "cressie-nopairs"
,
"cressie-dr-nopairs"
, "pairs"
, "pairs-invd"
,
"pairs-invrd"
, and "ols"
. Parameterizations for each weight
are available in Details. The default is "cressie"
.
A logical indicating whether (geometric) anisotropy should
be modeled. Not required if spcov_initial
is provided with 1) rotate
assumed unknown or assumed known and non-zero or 2) scale
assumed unknown
or assumed known and less than one. When anisotropy
is TRUE
,
computational times can significantly increase. The default is FALSE
.
A one-sided linear formula describing the random effect structure
of the model. Terms are specified to the right of the ~ operator
.
Each term has the structure x1 + ... + xn | g1/.../gm
, where x1 + ... + xn
specifies the model for the random effects and g1/.../gm
is the grouping
structure. Separate terms are separated by +
and must generally
be wrapped in parentheses. Random intercepts are added to each model
implicitly when at least one other variable is defined.
If a random intercept is not desired, this must be explicitly
defined (e.g., x1 + ... + xn - 1 | g1/.../gm
). If only a random intercept
is desired for a grouping structure, the random intercept must be specified
as 1 | g1/.../gm
. Note that g1/.../gm
is shorthand for (1 | g1/.../gm)
.
If only random intercepts are desired and the shorthand notation is used,
parentheses can be omitted.
An optional object specifying initial and/or known values for the random effect variances.
A one-sided linear formula with a single term specifying the partition factor. The partition factor assumes observations from different levels of the partition factor are uncorrelated.
An optional logical or list controlling the big data approximation.
If omitted, local
is set
to TRUE
or FALSE
based on the sample size (the number of
non-missing observations in data
) -- if the sample size exceeds 5,000,
local
is set to TRUE
. Otherwise it is set to FALSE
.
local
is also set to FALSE
when spcov_type
is "none"
and there are no random effects specified via random
.
If FALSE
, no big data approximation is implemented.
If a list is provided, the following arguments detail the big
data approximation:
index:
The group indexes. Observations in different
levels of index
are assumed to be uncorrelated for the
purposes of estimation. If index
is not provided, it is
determined by specifying method
and either size
or groups
.
method
: The big data approximation method used to determine index
. Ignored
if index
is provided. If method = "random"
,
observations are randomly assigned to index
based on size
.
If method = "kmeans"
, observations assigned to index
based on k-means clustering on the coordinates with groups
clusters. The default
is "kmeans"
. Note that both methods have a random component, which
means that you may get different results from separate model fitting calls.
To ensure consistent results, specify index
or set a seed via
base::set.seed()
.
size
: The number of observations in each index
group
when method
is "random"
. If the number of observations
is not divisible by size
, some levels get size - 1
observations.
The default is 100.
groups:
The number of index
groups. If method
is "random"
, size
is \(ceiling(n / groups)\), where
\(n\) is the sample size. Automatically determined if size
is specified. If method
is "kmeans"
, groups
is the number of clusters.
var_adjust:
The approach for adjusting the variance-covariance
matrix of the fixed effects. "none"
for no adjustment, "theoretical"
for the theoretically-correct adjustment,
"pooled"
for the pooled adjustment, and "empirical"
for the
empirical adjustment. The default is "theoretical"
for samples sizes
up to 100,000 and "none"
for samples sizes exceeding 100,000.
parallel
: If TRUE
, parallel processing via the
parallel package is automatically used. The default is FALSE
.
ncores
: If parallel = TRUE
, the number of cores to
parallelize over. The default is the number of available cores on your machine.
When local
is a list, at least one list element must be provided to
initialize default arguments for the other list elements.
If local
is TRUE
, defaults for local
are chosen such
that local
is transformed into
list(size = 100, method = "kmeans", var_adjust = "theoretical", parallel = FALSE)
.
An optional logical that indicates whether the range
should be constrained to enhance numerical stability. If range_constrain = TRUE
,
the maximum possible range value is 5 times the maximum distance in the domain.
If range_constrain = FALSE
, then maximum possible range is unbounded.
The default is FALSE
.
Note that if range_constrain = TRUE
and the value of range
in spcov_initial
is larger than range_constrain
, then range_constrain
is set to
FALSE
.
Other arguments to esv()
or stats::optim()
.
A list with many elements that store information about
the fitted model object. If spcov_type
or spcov_initial
are
length one, the list has class splm
. Many generic functions that
summarize model fit are available for splm
objects, including
AIC
, AICc
, anova
, augment
, BIC
, coef
,
cooks.distance
, covmatrix
, deviance
, fitted
, formula
,
glance
, glances
, hatvalues
, influence
,
labels
, logLik
, loocv
, model.frame
, model.matrix
,
plot
, predict
, print
, pseudoR2
, summary
,
terms
, tidy
, update
, varcomp
, and vcov
. If
spcov_type
or spcov_initial
are length greater than one, the
list has class splm_list
and each element in the list has class
splm
. glances
can be used to summarize splm_list
objects, and the aforementioned splm
generics can be used on each
individual list element (model fit).
The spatial linear model for point-referenced data (i.e., geostatistical model) can be written as \(y = X \beta + \tau + \epsilon\), where \(X\) is the fixed effects design matrix, \(\beta\) are the fixed effects, \(\tau\) is random error that is spatially dependent, and \(\epsilon\) is random error that is spatially independent. Together, \(\tau\) and \(\epsilon\) are modeled using a spatial covariance function, expressed as \(de * R + ie * I\), where \(de\) is the dependent error variance, \(R\) is a correlation matrix that controls the spatial dependence structure among observations, \(ie\) is the independent error variance, and \(I\) is an identity matrix.
spcov_type
Details: Parametric forms for \(R\) are given below, where \(\eta = h / range\)
for \(h\) distance between observations:
exponential: \(exp(- \eta )\)
spherical: \((1 - 1.5\eta + 0.5\eta^3) * I(h <= range)\)
gaussian: \(exp(- \eta^2 )\)
triangular: \((1 - \eta) * I(h <= range)\)
circular: \((1 - (2 / \pi) * (m * sqrt(1 - m^2) + sin^{-1}(m))) * I(h <= range), m = min(\eta, 1)\)
cubic: \((1 - 7\eta^2 + 8.75\eta^3 - 3.5\eta^5 + 0.75\eta^7) * I(h <= range)\)
pentaspherical: \((1 - 1.875\eta + 1.25\eta^3 - 0.375\eta^5) * I(h <= range)\)
cosine: \(cos(\eta)\)
wave: \(sin(\eta) / \eta * I(h > 0) + I(h = 0)\)
jbessel: \(Bj(h * range)\), Bj is Bessel-J function
gravity: \((1 + \eta^2)^{-0.5}\)
rquad: \((1 + \eta^2)^{-1}\)
magnetic: \((1 + \eta^2)^{-1.5}\)
matern: \(2^{1 - extra}/ \Gamma(extra) * \alpha^{extra} * Bk(\alpha, extra)\), \(\alpha = (2extra * \eta)^{0.5}\), Bk is Bessel-K function with order \(1/5 \le extra \le 5\)
cauchy: \((1 + \eta^2)^{-extra}\), \(extra > 0\)
pexponential: \(exp(h^{extra}/range)\), \(0 < extra \le 2\)
none: \(0\)
All spatial covariance functions are valid in one spatial dimension. All
spatial covariance functions except triangular
and cosine
are
valid in two dimensions.
estmethod
Details: The various estimation methods are
reml
: Maximize the restricted log-likelihood.
ml
: Maximize the log-likelihood.
sv-wls
: Minimize the semivariogram weighted least squares loss.
sv-cl
: Minimize the semivariogram composite likelihood loss.
anisotropy
Details: By default, all spatial covariance parameters except rotate
and scale
as well as all random effect variance parameters
are assumed unknown, requiring estimation. If either rotate
or scale
are given initial values other than 0 and 1 (respectively) or are assumed unknown
in spcov_initial()
, anisotropy
is implicitly set to TRUE
.
(Geometric) Anisotropy is modeled by transforming a covariance function that
decays differently in different directions to one that decays equally in all
directions via rotation and scaling of the original coordinates. The rotation is
controlled by the rotate
parameter in \([0, \pi]\) radians. The scaling
is controlled by the scale
parameter in \([0, 1]\). The anisotropy
correction involves first a rotation of the coordinates clockwise by rotate
and then a
scaling of the coordinates' minor axis by the reciprocal of scale
. The spatial
covariance is then computed using these transformed coordinates.
random
Details: If random effects are used (the estimation method must be "reml"
or
"ml"
), the model
can be written as \(y = X \beta + Z1u1 + ... Zjuj + \tau + \epsilon\),
where each Z is a random effects design matrix and each u is a random effect.
partition_factor
Details: The partition factor can be represented in matrix form as \(P\), where
elements of \(P\) equal one for observations in the same level of the partition
factor and zero otherwise. The covariance matrix involving only the
spatial and random effects components is then multiplied element-wise
(Hadmard product) by \(P\), yielding the final covariance matrix.
local
Details: The big data approximation works by sorting observations into different levels
of an index variable. Observations in different levels of the index variable
are assumed to be uncorrelated for the purposes of model fitting. Sparse matrix methods are then implemented
for significant computational gains. Parallelization generally further speeds up
computations when data sizes are larger than a few thousand. Both the "random"
and "kmeans"
values of method
in local
have random components. That means you may get slightly different
results when using the big data approximation and rerunning splm()
with the same code. For consistent results,
either set a seed via base::set.seed()
or specify index
to local
.
Observations with NA
response values are removed for model
fitting, but their values can be predicted afterwards by running
predict(object)
.
This function does not perform any internal scaling. If optimization is not stable due to large extremely large variances, scale relevant variables so they have variance 1 before optimization.
spmod <- splm(z ~ water + tarp,
data = caribou,
spcov_type = "exponential", xcoord = x, ycoord = y
)
summary(spmod)
#>
#> Call:
#> splm(formula = z ~ water + tarp, data = caribou, spcov_type = "exponential",
#> xcoord = x, ycoord = y)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.41281 -0.20763 -0.11205 0.02956 0.45429
#>
#> Coefficients (fixed):
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.04981 0.31093 6.592 4.33e-11 ***
#> waterY -0.08310 0.06449 -1.289 0.197563
#> tarpnone 0.08005 0.07759 1.032 0.302166
#> tarpshade 0.28654 0.07667 3.737 0.000186 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Pseudo R-squared: 0.3963
#>
#> Coefficients (exponential spatial covariance):
#> de ie range
#> 0.1109 0.0226 19.1168