Fit spatial generalized linear models for point-referenced data (i.e., generalized geostatistical models) using a variety of estimation methods, allowing for random effects, anisotropy, partition factors, and big data methods.

spglm(
  formula,
  family,
  data,
  spcov_type,
  xcoord,
  ycoord,
  spcov_initial,
  dispersion_initial,
  estmethod = "reml",
  anisotropy = FALSE,
  random,
  randcov_initial,
  partition_factor,
  local,
  range_constrain,
  ...
)

Arguments

formula

A two-sided linear formula describing the fixed effect structure of the model, with the response to the left of the ~ operator and the terms on the right, separated by + operators.

family

The generalized linear model family describing the distribution of the response variable to be used. Available options "poisson", "nbinomial", "binomial", "beta", "Gamma", and "inverse.gaussian". Can be quoted or unquoted. Note that the family argument only takes a single value, rather than the list structure used by stats::glm. See Details for more.

data

A data frame or sf object object that contains the variables in fixed, random, and partition_factor as well as geographical information. If an sf object is provided with POINT geometries, the x-coordinates and y-coordinates are used directly. If an sf object is provided with POLYGON geometries, the x-coordinates and y-coordinates are taken as the centroids of each polygon.

spcov_type

The spatial covariance type. Available options include "exponential", "spherical", "gaussian", "triangular", "circular", "cubic", "pentaspherical", "cosine", "wave", "jbessel", "gravity", "rquad", "magnetic", "matern", "cauchy", "pexponential", "ie", and "none". Parameterizations of each spatial covariance type are available in Details. Multiple spatial covariance types can be provided as a character vector, and then spglm() is called iteratively for each element and a list is returned for each model fit. The default for spcov_type is "exponential". When spcov_type is specified, all unknown spatial covariance parameters are estimated. spcov_type is ignored if spcov_initial is provided.

xcoord

The name of the column in data representing the x-coordinate. Can be quoted or unquoted. Not required if data is an sf object.

ycoord

The name of the column in data representing the y-coordinate. Can be quoted or unquoted. Not required if data is an sf object.

spcov_initial

An object from spcov_initial() specifying initial and/or known values for the spatial covariance parameters. Multiple spcov_initial() objects can be provided in a list. Then spglm() is called iteratively for each element and a list is returned for each model fit.

dispersion_initial

An object from dispersion_initial() specifying initial and/or known values for the dispersion parameter for the "nbinomial", "beta", "Gamma", and "inverse.gaussian" families. family is ignored if dispersion_initial is provided.

estmethod

The estimation method. Available options include "reml" for restricted maximum likelihood and "ml" for maximum likelihood. The default is "reml".

anisotropy

A logical indicating whether (geometric) anisotropy should be modeled. Not required if spcov_initial is provided with 1) rotate assumed unknown or assumed known and non-zero or 2) scale assumed unknown or assumed known and less than one. When anisotropy is TRUE, computational times can significantly increase. The default is FALSE.

random

A one-sided linear formula describing the random effect structure of the model. Terms are specified to the right of the ~ operator. Each term has the structure x1 + ... + xn | g1/.../gm, where x1 + ... + xn specifies the model for the random effects and g1/.../gm is the grouping structure. Separate terms are separated by + and must generally be wrapped in parentheses. Random intercepts are added to each model implicitly when at least one other variable is defined. If a random intercept is not desired, this must be explicitly defined (e.g., x1 + ... + xn - 1 | g1/.../gm). If only a random intercept is desired for a grouping structure, the random intercept must be specified as 1 | g1/.../gm. Note that g1/.../gm is shorthand for (1 | g1/.../gm). If only random intercepts are desired and the shorthand notation is used, parentheses can be omitted.

randcov_initial

An optional object specifying initial and/or known values for the random effect variances.

partition_factor

A one-sided linear formula with a single term specifying the partition factor. The partition factor assumes observations from different levels of the partition factor are uncorrelated.

local

An optional logical or list controlling the big data approximation. If omitted, local is set to TRUE or FALSE based on the sample size (the number of non-missing observations in data) -- if the sample size exceeds 3,000, local is set to TRUE. Otherwise it is set to FALSE. If FALSE, no big data approximation is implemented. If a list is provided, the following arguments detail the big data approximation:

  • index: The group indexes. Observations in different levels of index are assumed to be uncorrelated for the purposes of estimation. If index is not provided, it is determined by specifying method and either size or groups.

  • method: The big data approximation method used to determine index. Ignored if index is provided. If method = "random", observations are randomly assigned to index based on size. If method = "kmeans", observations assigned to index based on k-means clustering on the coordinates with groups clusters. The default is "kmeans". Note that both methods have a random component, which means that you may get different results from separate model fitting calls. To ensure consistent results, specify index or set a seed via base::set.seed().

  • size: The number of observations in each index group when method is "random". If the number of observations is not divisible by size, some levels get size - 1 observations. The default is 100.

  • groups: The number of index groups. If method is "random", size is \(ceiling(n / groups)\), where \(n\) is the sample size. Automatically determined if size is specified. If method is "kmeans", groups is the number of clusters.

  • var_adjust: The approach for adjusting the variance-covariance matrix of the fixed effects. "none" for no adjustment, "theoretical" for the theoretically-correct adjustment, "pooled" for the pooled adjustment, and "empirical" for the empirical adjustment. The default is "theoretical" for samples sizes up to 100,000 and "none" for samples sizes exceeding 100,000.

  • parallel: If TRUE, parallel processing via the parallel package is automatically used. The default is FALSE.

  • ncores: If parallel = TRUE, the number of cores to parallelize over. The default is the number of available cores on your machine.

When local is a list, at least one list element must be provided to initialize default arguments for the other list elements. If local is TRUE, defaults for local are chosen such that local is transformed into list(size = 100, method = "kmeans", var_adjust = "theoretical", parallel = FALSE).

range_constrain

An optional logical that indicates whether the range should be constrained to enhance numerical stability. If range_constrain = TRUE, the maximum possible range value is 4 times the maximum distance in the domain. If range_constrain = FALSE, then maximum possible range is unbounded. The default is FALSE. Note that if range_constrain = TRUE and the value of range in spcov_initial is larger than range_constrain, then range_constrain is set to FALSE.

...

Other arguments to esv() or stats::optim().

Value

A list with many elements that store information about the fitted model object. If spcov_type or spcov_initial are length one, the list has class spglm. Many generic functions that summarize model fit are available for spglm objects, including AIC, AICc, anova, augment, AUROC, BIC, coef, cooks.distance, covmatrix, deviance, fitted, formula, glance, glances, hatvalues, influence, labels, logLik, loocv, model.frame, model.matrix, plot, predict, print, pseudoR2, summary, terms, tidy, update, varcomp, and vcov. If spcov_type or spcov_initial are length greater than one, the list has class spglm_list and each element in the list has class spglm. glances can be used to summarize spglm_list

objects, and the aforementioned spglm generics can be used on each individual list element (model fit).

Details

The spatial generalized linear model for point-referenced data (i.e., generalized geostatistical model) can be written as \(g(\mu) = \eta = X \beta + \tau + \epsilon\), where \(\mu\) is the expectation of the response (\(y\)) given the random errors, \(g(.)\) is called a link function which links together the \(\mu\) and \(\eta\), \(X\) is the fixed effects design matrix, \(\beta\) are the fixed effects, \(\tau\) is random error that is spatially dependent, and \(\epsilon\) is random error that is spatially independent.

There are six generalized linear model families available: poisson assumes \(y\) is a Poisson random variable nbinomial assumes \(y\) is a negative binomial random variable, binomial assumes \(y\) is a binomial random variable, beta assumes \(y\) is a beta random variable, Gamma assumes \(y\) is a gamma random variable, and inverse.gaussian assumes \(y\) is an inverse Gaussian random variable.

The supports for \(y\) for each family are given below:

  • family: support of \(y\)

  • poisson: \(0 \le y\); \(y\) an integer

  • nbinomial: \(0 \le y\); \(y\) an integer

  • binomial: \(0 \le y\); \(y\) an integer

  • beta: \(0 < y < 1\)

  • Gamma: \(0 < y\)

  • inverse.gaussian: \(0 < y\)

The generalized linear model families and the parameterizations of their link functions are given below:

  • family: link function

  • poisson: \(g(\mu) = log(\eta)\) (log link)

  • nbinomial: \(g(\mu) = log(\eta)\) (log link)

  • binomial: \(g(\mu) = log(\eta / (1 - \eta))\) (logit link)

  • beta: \(g(\mu) = log(\eta / (1 - \eta))\) (logit link)

  • Gamma: \(g(\mu) = log(\eta)\) (log link)

  • inverse.gaussian: \(g(\mu) = log(\eta)\) (log link)

The variance function of an individual \(y\) (given \(\mu\)) for each generalized linear model family is given below:

  • family: \(Var(y)\)

  • poisson: \(\mu \phi\)

  • nbinomial: \(\mu + \mu^2 / \phi\)

  • binomial: \(n \mu (1 - \mu) \phi\)

  • beta: \(\mu (1 - \mu) / (1 + \phi)\)

  • Gamma: \(\mu^2 / \phi\)

  • inverse.gaussian: \(\mu^2 / \phi\)

The parameter \(\phi\) is a dispersion parameter that influences \(Var(y)\). For the poisson and binomial families, \(\phi\) is always one. Note that this inverse Gaussian parameterization is different than a standard inverse Gaussian parameterization, which has variance \(\mu^3 / \lambda\). Setting \(\phi = \lambda / \mu\) yields our parameterization, which is preferred for computational stability. Also note that the dispersion parameter is often defined in the literature as \(V(\mu) \phi\), where \(V(\mu)\) is the variance function of the mean. We do not use this parameterization, which is important to recognize while interpreting dispersion estimates. For more on generalized linear model constructions, see McCullagh and Nelder (1989).

Together, \(\tau\) and \(\epsilon\) are modeled using a spatial covariance function, expressed as \(de * R + ie * I\), where \(de\) is the dependent error variance, \(R\) is a correlation matrix that controls the spatial dependence structure among observations, \(ie\) is the independent error variance, and \(I\) is an identity matrix. Recall that \(\tau\) and \(\epsilon\) are modeled on the link scale, not the inverse link (response) scale. Random effects are also modeled on the link scale.

spcov_type Details: Parametric forms for \(R\) are given below, where \(\eta = h / range\) for \(h\) distance between observations:

  • exponential: \(exp(- \eta )\)

  • spherical: \((1 - 1.5\eta + 0.5\eta^3) * I(h <= range)\)

  • gaussian: \(exp(- \eta^2 )\)

  • triangular: \((1 - \eta) * I(h <= range)\)

  • circular: \((1 - (2 / \pi) * (m * sqrt(1 - m^2) + sin^{-1}(m))) * I(h <= range), m = min(\eta, 1)\)

  • cubic: \((1 - 7\eta^2 + 8.75\eta^3 - 3.5\eta^5 + 0.75\eta^7) * I(h <= range)\)

  • pentaspherical: \((1 - 1.875\eta + 1.25\eta^3 - 0.375\eta^5) * I(h <= range)\)

  • cosine: \(cos(\eta)\)

  • wave: \(sin(\eta) / \eta * I(h > 0) + I(h = 0)\)

  • jbessel: \(Bj(h * range)\), Bj is Bessel-J function

  • gravity: \((1 + \eta^2)^{-0.5}\)

  • rquad: \((1 + \eta^2)^{-1}\)

  • magnetic: \((1 + \eta^2)^{-1.5}\)

  • matern: \(2^{1 - extra}/ \Gamma(extra) * \alpha^{extra} * Bk(\alpha, extra)\), \(\alpha = (2extra * \eta)^{0.5}\), Bk is Bessel-K function with order \(1/5 \le extra \le 5\)

  • cauchy: \((1 + \eta^2)^{-extra}\), \(extra > 0\)

  • pexponential: \(exp(h^{extra}/range)\), \(0 < extra \le 2\)

  • none: \(0\)

  • ie: \(0\) (see below for differences compared to none)

There is a slight difference between none and ie (the spcov_type) from above. none fixes the "ie" covariance parameter at zero, which should yield a model that has very similar parameter estimates as one from stats::glm. Note that spglm() uses a different likelihood, so direct likelihood-based comparisons (e.g., AIC) should not be made to compare an spglm() model to a glm() one (though deviance can be used). ie allows the "ie" covariance parameter to vary. If the "ie" covariance parameter is estimated near zero, then none and ie yield similar models.

All spatial covariance functions are valid in one spatial dimension. All spatial covariance functions except triangular and cosine are valid in two dimensions.

estmethod Details: The various estimation methods are

  • reml: Maximize the restricted log-likelihood.

  • ml: Maximize the log-likelihood.

Note that the likelihood being optimized is obtained using the Laplace approximation.

anisotropy Details: By default, all spatial covariance parameters except rotate and scale as well as all random effect variance parameters are assumed unknown, requiring estimation. If either rotate or scale are given initial values other than 0 and 1 (respectively) or are assumed unknown in spcov_initial(), anisotropy is implicitly set to TRUE. (Geometric) Anisotropy is modeled by transforming a covariance function that decays differently in different directions to one that decays equally in all directions via rotation and scaling of the original coordinates. The rotation is controlled by the rotate parameter in \([0, \pi]\) radians. The scaling is controlled by the scale parameter in \([0, 1]\). The anisotropy correction involves first a rotation of the coordinates clockwise by rotate and then a scaling of the coordinates' minor axis by the reciprocal of scale. The spatial covariance is then computed using these transformed coordinates.

random Details: If random effects are used, the model can be written as \(y = X \beta + Z1u1 + ... Zjuj + \tau + \epsilon\), where each Z is a random effects design matrix and each u is a random effect.

partition_factor Details: The partition factor can be represented in matrix form as \(P\), where elements of \(P\) equal one for observations in the same level of the partition factor and zero otherwise. The covariance matrix involving only the spatial and random effects components is then multiplied element-wise (Hadmard product) by \(P\), yielding the final covariance matrix.

local Details: The big data approximation works by sorting observations into different levels of an index variable. Observations in different levels of the index variable are assumed to be uncorrelated for the purposes of model fitting. Sparse matrix methods are then implemented for significant computational gains. Parallelization generally further speeds up computations when data sizes are larger than a few thousand. Both the "random" and "kmeans" values of method in local have random components. That means you may get slightly different results when using the big data approximation and rerunning spglm() with the same code. For consistent results, either set a seed via base::set.seed() or specify index to local.

Observations with NA response values are removed for model fitting, but their values can be predicted afterwards by running predict(object).

Note

This function does not perform any internal scaling. If optimization is not stable due to large extremely large variances, scale relevant variables so they have variance 1 before optimization.

References

McCullagh P. and Nelder, J. A. (1989) Generalized Linear Models. London: Chapman and Hall.

Examples

spgmod <- spglm(presence ~ elev,
  family = "binomial", data = moose,
  spcov_type = "exponential"
)
summary(spgmod)
#> 
#> Call:
#> spglm(formula = presence ~ elev, family = "binomial", data = moose, 
#>     spcov_type = "exponential")
#> 
#> Deviance Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.5249 -0.8114  0.5600  0.8306  1.5757 
#> 
#> Coefficients (fixed):
#>              Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.874038   1.140953  -0.766    0.444
#> elev         0.002365   0.003184   0.743    0.458
#> 
#> Pseudo R-squared: 0.00311
#> 
#> Coefficients (exponential spatial covariance):
#>        de        ie     range 
#> 3.746e+00 4.392e-03 3.203e+04 
#> 
#> Coefficients (Dispersion for binomial family):
#> dispersion 
#>          1