Fit spatial generalized autoregressive models

Fit spatial generalized linear models for areal data (i.e., spatial generalized autoregressive models) using a variety of estimation methods, allowing for random effects, partition factors, and row standardization.

spgautor(
  formula,
  family,
  data,
  spcov_type,
  spcov_initial,
  dispersion_initial,
  estmethod = "reml",
  random,
  randcov_initial,
  partition_factor,
  W,
  row_st = TRUE,
  M,
  range_positive = TRUE,
  cutoff,
  ...
)

Arguments

formula: A two-sided linear formula describing the fixed effect structure of the model, with the response to the left of the ~ operator and the terms, separated by + operators, on the right.
family: The generalized linear model family describing the distribution of the response variable to be used. Available options "poisson", "nbinomial", "binomial", "beta", "Gamma", and "inverse.gaussian". Can be quoted or unquoted. Note that the family argument only takes a single value, rather than the list structure used by stats::glm. See Details for more.
data: A data frame or sf object that contains the variables in fixed, random, and partition_factor, as well as potentially geographical information.
spcov_type: The spatial covariance type. Available options include "car", "sar", "none", and "ie". Parameterizations of each spatial covariance type are available in Details. When spcov_type is specified, relevant spatial covariance parameters are assumed unknown, requiring estimation. spcov_type is not required (and is ignored) if spcov_initial is provided. Multiple values can be provided in a character vector. Then spgautor() is called iteratively for each element and a list is returned for each model fit. The default for spcov_type is "car". When spcov_type is "none" or "ie", splm() is called.
spcov_initial: An object from spcov_initial() specifying initial and/or known values for the spatial covariance parameters. Not required if spcov_type is provided. Multiple spcov_initial() objects can be provided in a list. Then spgautor() is called iteratively for each element and a list is returned for each model fit.
dispersion_initial: An object from dispersion_initial() specifying initial and/or known values for the dispersion parameter for the "nbinomial", "beta", "Gamma", and "inverse.gaussian" families. family is ignored if dispersion_initial is provided.
estmethod: The estimation method. Available options include "reml" for restricted maximum likelihood and "ml" for maximum likelihood The default is "reml".
random: A one-sided linear formula describing the random effect structure of the model. Terms are specified to the right of the ~ operator. Each term has the structure x1 + ... + xn | g1/.../gm, where x1 + ... + xn specifies the model for the random effects and g1/.../gm is the grouping structure. Separate terms are separated by + and must generally be wrapped in parentheses. Random intercepts are added to each model implicitly when at least one other variable is defined. If a random intercept is not desired, this must be explicitly defined (e.g., x1 + ... + xn - 1 | g1/.../gm). If only a random intercept is desired for a grouping structure, the random intercept must be specified as 1 | g1/.../gm. Note that g1/.../gm is shorthand for (1 | g1/.../gm). If only random intercepts are desired and the shorthand notation is used, parentheses can be omitted.
randcov_initial: An optional object specifying initial and/or known values for the random effect variances.
partition_factor: A one-sided linear formula with a single term specifying the partition factor. The partition factor assumes observations from different levels of the partition factor are uncorrelated.
W: Weight matrix specifying the neighboring structure used. Not required if data is an sf object wtih POLYGON geometry, as W is calculated internally using queen contiguity. If calculated internally, W is computed using sf::st_intersects(). Also not required if data is an sf object with POINT geometry as long as cutoff is specified.
row_st: A logical indicating whether row standardization be performed on W. The default is TRUE.
M: M matrix satisfying the car symmetry condition. The car symmetry condition states that \((I - range * W)^{-1}M\) is symmetric, where \(I\) is an identity matrix, \(range\) is a constant that controls the spatial dependence, W is the weights matrix, and \(^{-1}\) represents the inverse operator. M is required for car models when W is provided and row_st is FALSE. When M, is required, the default is the identity matrix. M must be diagonal or given as a vector or one-column matrix assumed to be the diagonal.
range_positive: Whether the range should be constrained to be positive. The default is TRUE.
cutoff: The numeric, distance-based cutoff used to determine W when W is not specified. For an sf object with POINT geometry, two locations are considered neighbors if the distance between them is less than or equal to cutoff.
...: Other arguments to stats::optim().

Value

A list with many elements that store information about the fitted model object. If spcov_type or spcov_initial are length one, the list has class spgautor. Many generic functions that summarize model fit are available for spgautor objects, including AIC, AICc, anova, augment, AUROC, BIC, coef, cooks.distance, covmatrix, deviance, fitted, formula, glance, glances, hatvalues, influence, labels, logLik, loocv, model.frame, model.matrix, plot, predict, print, pseudoR2, summary, terms, tidy, update, varcomp, and vcov. If spcov_type or spcov_initial are length greater than one, the list has class spgautor_list and each element in the list has class spgautor. glances can be used to summarize spgautor_list

objects, and the aforementioned spgautor generics can be used on each individual list element (model fit).

Details

The spatial generalized linear model for areal data (i.e., spatial generalized autoregressive model) can be written as \(g(\mu) = \eta = X \beta + \tau + \epsilon\), where \(\mu\) is the expectation of the response (\(y\)) given the random errors, \(g(.)\) is called a link function which links together the \(\mu\) and \(\eta\), \(X\) is the fixed effects design matrix, \(\beta\) are the fixed effects, \(\tau\) is random error that is spatially dependent, and \(\epsilon\) is random error that is spatially independent.

There are six generalized linear model families available: poisson assumes \(y\) is a Poisson random variable nbinomial assumes \(y\) is a negative binomial random variable, binomial assumes \(y\) is a binomial random variable, beta assumes \(y\) is a beta random variable, Gamma assumes \(y\) is a gamma random variable, and inverse.gaussian assumes \(y\) is an inverse Gaussian random variable.

The supports for \(y\) for each family are given below:

family: support of \(y\)
poisson: \(0 \le y\); \(y\) an integer
nbinomial: \(0 \le y\); \(y\) an integer
binomial: \(0 \le y\); \(y\) an integer
beta: \(0 < y < 1\)
Gamma: \(0 < y\)
inverse.gaussian: \(0 < y\)

The generalized linear model families and the parameterizations of their link functions are given below:

family: link function
poisson: \(g(\mu) = log(\eta)\) (log link)
nbinomial: \(g(\mu) = log(\eta)\) (log link)
binomial: \(g(\mu) = log(\eta / (1 - \eta))\) (logit link)
beta: \(g(\mu) = log(\eta / (1 - \eta))\) (logit link)
Gamma: \(g(\mu) = log(\eta)\) (log link)
inverse.gaussian: \(g(\mu) = log(\eta)\) (log link)

The variance function of an individual \(y\) (given \(\mu\)) for each generalized linear model family is given below:

family: \(Var(y)\)
poisson: \(\mu \phi\)
nbinomial: \(\mu + \mu^2 / \phi\)
binomial: \(n \mu (1 - \mu) \phi\)
beta: \(\mu (1 - \mu) / (1 + \phi)\)
Gamma: \(\mu^2 / \phi\)
inverse.gaussian: \(\mu^2 / \phi\)

The parameter \(\phi\) is a dispersion parameter that influences \(Var(y)\). For the poisson and binomial families, \(\phi\) is always one. Note that this inverse Gaussian parameterization is different than a standard inverse Gaussian parameterization, which has variance \(\mu^3 / \lambda\). Setting \(\phi = \lambda / \mu\) yields our parameterization, which is preferred for computational stability. Also note that the dispersion parameter is often defined in the literature as \(V(\mu) \phi\), where \(V(\mu)\) is the variance function of the mean. We do not use this parameterization, which is important to recognize while interpreting dispersion parameter estimates. For more on generalized linear model constructions, see McCullagh and Nelder (1989).

Together, \(\tau\) and \(\epsilon\) are modeled using a spatial covariance function, expressed as \(de * R + ie * I\), where \(de\) is the dependent error variance, \(R\) is a matrix that controls the spatial dependence structure among observations, \(ie\) is the independent error variance, and \(I\) is an identity matrix. Note that \(de\) and \(ie\) must be non-negative while \(range\) must be between the reciprocal of the maximum eigenvalue of W and the reciprocal of the minimum eigenvalue of W. Recall that \(\tau\) and \(\epsilon\) are modeled on the link scale, not the inverse link (response) scale. Random effects are also modeled on the link scale.

spcov_type Details: Parametric forms for \(R\) are given below:

car: \((I - range * W)^{-1}M\), weights matrix \(W\), symmetry condition matrix \(M\)
sar: \([(I - range * W)(I - range * W)^T]^{-1}\), weights matrix \(W\), \(^T\) indicates matrix transpose
none: \(0\)
ie: \(0\) (see spglm() for differences compared to none)

If there are observations with no neighbors, they are given a unique variance parameter called extra, which must be non-negative.

estmethod Details: The various estimation methods are

reml: Maximize the restricted log-likelihood.
ml: Maximize the log-likelihood.

Note that the likelihood being optimized is obtained using the Laplace approximation.

By default, all spatial covariance parameters except ie as well as all random effect variance parameters are assumed unknown, requiring estimation. ie is assumed zero and known by default (in contrast to models fit using spglm(), where ie is assumed unknown by default). To change this default behavior, specify spcov_initial (an NA value for ie in spcov_initial to assume ie is unknown, requiring estimation).

random Details: If random effects are used, the model can be written as \(y = X \beta + Z1u1 + ... Zjuj + \tau + \epsilon\), where each Z is a random effects design matrix and each u is a random effect.

partition_factor Details: The partition factor can be represented in matrix form as \(P\), where elements of \(P\) equal one for observations in the same level of the partition factor and zero otherwise. The covariance matrix involving only the spatial and random effects components is then multiplied element-wise (Hadmard product) by \(P\), yielding the final covariance matrix.

Observations with NA response values are removed for model fitting, but their values can be predicted afterwards by running predict(object). This is the only way to perform prediction for spgautor() models (i.e., the prediction locations must be known prior to estimation).

Note

This function does not perform any internal scaling. If optimization is not stable due to large extremely large variances, scale relevant variables so they have variance 1 before optimization.

References

McCullagh P. and Nelder, J. A. (1989) Generalized Linear Models. London: Chapman and Hall.

Examples

spgmod <- spgautor(I(log_trend^2) ~ 1, family = "Gamma", data = seal, spcov_type = "car")
summary(spgmod)
#> 
#> Call:
#> spgautor(formula = I(log_trend^2) ~ 1, family = "Gamma", data = seal, 
#>     spcov_type = "car")
#> 
#> Deviance Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.5034 -2.1822 -1.2659 -0.3373  6.2996 
#> 
#> Coefficients (fixed):
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  -3.6675     0.1937  -18.94   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Coefficients (car spatial covariance):
#>      de   range   extra 
#> 0.00657 0.99331 0.00657 
#> 
#> Coefficients (Dispersion for Gamma family):
#> dispersion 
#>     0.3374