Skip to contents

Introduction

This vignette covers technical details regarding the functions in that perform computations in spmodel. We first provide a notation guide and then describe relevant details for each function.

If you use spmodel in a formal publication or report, please cite it. Citing spmodel lets us devote more resources to it in the future. To view the spmodel citation, run

citation(package = "spmodel")
#> To cite spmodel in publications use:
#> 
#>   Dumelle M, Higham M, Ver Hoef JM (2023). spmodel: Spatial statistical
#>   modeling and prediction in R. PLOS ONE 18(3): e0282524.
#>   https://doi.org/10.1371/journal.pone.0282524
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {{spmodel}: Spatial statistical modeling and prediction in {R}},
#>     author = {Michael Dumelle and Matt Higham and Jay M. {Ver Hoef}},
#>     journal = {PLOS ONE},
#>     year = {2023},
#>     volume = {18},
#>     number = {3},
#>     pages = {1--32},
#>     doi = {10.1371/journal.pone.0282524},
#>     url = {https://doi.org/10.1371/journal.pone.0282524},
#>   }

Notation Guide

n=Sample size𝐲=Response vector𝛃=Fixed effect parameter vector𝐗=Design matrix of known explanatory variables (covariates)p=The number of linearly independent columns in 𝐗𝛍=Mean vector𝐰=Latent generalized linear model mean on the link scaleφ=Dispersion parameter𝐙=Design matrix of known random effect variables𝛉=Covariance parameter vector𝚺=Covariance matrix evaluated at 𝛉𝚺1=The inverse of 𝚺𝚺1/2=The square root of 𝚺𝚺1/2=The inverse of 𝚺1/2𝚯=General parameter vector(𝚯)=Log-likelihood evaluated at 𝚯𝛕=Spatial (dependent) random error𝛜=Independent (non-spatial) random error𝐀*=𝚺1/2𝐀 for a general matrix 𝐀 (this is known as whitening 𝐀)\begin{equation*} \begin{split} n & = \text{Sample size} \\ \mathbf{y} & = \text{Response vector} \\ \boldsymbol{\beta} & = \text{Fixed effect parameter vector} \\ \mathbf{X} & = \text{Design matrix of known explanatory variables (covariates)} \\ p & = \text{The number of linearly independent columns in } \mathbf{X} \\ \boldsymbol{\mu} & = \text{Mean vector} \\ \mathbf{w} & = \text{Latent generalized linear model mean on the link scale} \\ \varphi & = \text{Dispersion parameter} \\ \mathbf{Z} & = \text{Design matrix of known random effect variables} \\ \boldsymbol{\theta} & = \text{Covariance parameter vector} \\ \boldsymbol{\Sigma} & = \text{Covariance matrix evaluated at } \boldsymbol{\theta} \\ \boldsymbol{\Sigma}^{-1} & = \text{The inverse of } \boldsymbol{\Sigma} \\ \boldsymbol{\Sigma}^{1/2} & = \text{The square root of } \boldsymbol{\Sigma} \\ \boldsymbol{\Sigma}^{-1/2} & = \text{The inverse of } \boldsymbol{\Sigma}^{1/2} \\ \boldsymbol{\Theta} & = \text{General parameter vector} \\ \ell(\boldsymbol{\Theta}) & = \text{Log-likelihood evaluated at } \boldsymbol{\Theta} \\ \boldsymbol{\tau} & = \text{Spatial (dependent) random error} \\ \boldsymbol{\epsilon} & = \text{Independent (non-spatial) random error} \\ \mathbf{A}^* & = \boldsymbol{\Sigma}^{-1/2}\mathbf{A} \text{ for a general matrix } \mathbf{A} \text{ (this is known as whitening $\mathbf{A}$)} \end{split} \end{equation*}

A hat indicates the parameters are estimated (i.e., 𝛃̂\hat{\boldsymbol{\beta}}) or evaluated at a relevant estimated parameter vector (e.g., 𝚺̂\hat{\boldsymbol{\Sigma}} is evaluated at 𝛉̂\hat{\boldsymbol{\theta}}). When (𝚯̂)\ell(\boldsymbol{\hat{\Theta}}) is written, it means the log-likelihood evaluated at its maximum, 𝚯̂\boldsymbol{\hat{\Theta}}. When the covariance matrix of 𝐀\mathbf{A} is 𝚺\boldsymbol{\Sigma}, we say 𝐀*\mathbf{A}^* “whitens” 𝐀\mathbf{A} because Cov(𝐀*)=Cov(𝚺1/2𝐀)=𝚺1/2Cov(𝐀)𝚺1/2=𝚺1/2𝚺𝚺1/2=(𝚺1/2𝚺1/2)(𝚺1/2𝚺1/2)=𝐈.\begin{equation*} \text{Cov}(\mathbf{A}^*) = \text{Cov}(\boldsymbol{\Sigma}^{-1/2}\mathbf{A}) = \boldsymbol{\Sigma}^{-1/2}\text{Cov}(\mathbf{A})\boldsymbol{\Sigma}^{-1/2} = \boldsymbol{\Sigma}^{-1/2}\boldsymbol{\Sigma} \boldsymbol{\Sigma}^{-1/2} = (\boldsymbol{\Sigma}^{-1/2}\boldsymbol{\Sigma}^{1/2})(\boldsymbol{\Sigma}^{1/2}\boldsymbol{\Sigma}^{-1/2}) = \mathbf{I}. \end{equation*} Later we discuss how to obtain 𝚺1/2\boldsymbol{\Sigma}^{1/2}.

Additional notation is used in the predict() section: 𝐲o=Observed response vector𝐲u=Unobserved response vector𝐗o=Design matrix of known explanatory variables at observed response variable locations𝐗u=Design matrix of known explanatory variables at unobserved response variable locations𝚺o=Covariance matrix of 𝐲o evaluated at 𝛉𝚺u=Covariance matrix of 𝐲u evaluated at 𝛉𝚺uo=A matrix of covariances between 𝐲u and 𝐲o evaluated at 𝛉𝐰o=Latent 𝐰 for each observation in 𝐲o𝐰u=Latent 𝐰 for each observation in 𝐲o𝐆o=Hessian for 𝐰o\begin{equation*} \begin{split} \mathbf{y}_o & = \text{Observed response vector} \\ \mathbf{y}_u & = \text{Unobserved response vector} \\ \mathbf{X}_o & = \text{Design matrix of known explanatory variables at observed response variable locations} \\ \mathbf{X}_u & = \text{Design matrix of known explanatory variables at unobserved response variable locations} \\ \boldsymbol{\Sigma}_o & = \text{Covariance matrix of $\mathbf{y}_o$ evaluated at } \boldsymbol{\theta} \\ \boldsymbol{\Sigma}_u & = \text{Covariance matrix of $\mathbf{y}_u$ evaluated at } \boldsymbol{\theta} \\ \boldsymbol{\Sigma}_{uo} & = \text{A matrix of covariances between $\mathbf{y}_u$ and $\mathbf{y}_o$ evaluated at } \boldsymbol{\theta} \\ \mathbf{w}_o & = \text{Latent $\mathbf{w}$ for each observation in $\mathbf{y}_o$} \\ \mathbf{w}_u & = \text{Latent $\mathbf{w}$ for each observation in $\mathbf{y}_o$} \\ \mathbf{G}_o & = \text{Hessian for $\mathbf{w}_o$} \\ \end{split} \end{equation*}

Spatial Linear Models

Statistical linear models are often parameterized as 𝐲=𝐗𝛃+𝛜,\begin{equation}\label{eq:lm} \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}, \end{equation} where for a sample size nn, 𝐲\mathbf{y} is an n×1n \times 1 column vector of response variables, 𝐗\mathbf{X} is an n×pn \times p design (model) matrix of explanatory variables, 𝛃\boldsymbol{\beta} is a p×1p \times 1 column vector of fixed effects controlling the impact of 𝐗\mathbf{X} on 𝐲\mathbf{y}, and 𝛜\boldsymbol{\epsilon} is an n×1n \times 1 column vector of random errors. We typically assume that E(𝛜)=𝟎\text{E}(\boldsymbol{\epsilon}) = \mathbf{0} and Cov(𝛜)=σϵ2𝐈\text{Cov}(\boldsymbol{\epsilon}) = \sigma^2_\epsilon \mathbf{I}, where E()\text{E}(\cdot) denotes expectation, Cov()\text{Cov}(\cdot) denotes covariance, σϵ2\sigma^2_\epsilon denotes a variance parameter, and 𝐈\mathbf{I} denotes the identity matrix.

The model 𝐲=𝐗𝛃+𝛜\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} assumes the elements of 𝐲\mathbf{y} are uncorrelated. Typically for spatial data, elements of 𝐲\mathbf{y} are correlated, as observations close together in space tend to be more similar than observations far apart (Tobler 1970). Failing to properly accommodate the spatial dependence in 𝐲\mathbf{y} can cause researchers to draw incorrect conclusions about their data. To accommodate spatial dependence in 𝐲\mathbf{y}, an n×1n \times 1 spatial random effect, 𝛕\boldsymbol{\tau}, is added to the linear model, yielding the model 𝐲=𝐗𝛃+𝛕+𝛜,\begin{equation}\label{eq:splm} \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon}, \end{equation} where 𝛕\boldsymbol{\tau} is independent of 𝛜\boldsymbol{\epsilon}, E(𝛕)=𝟎\text{E}(\boldsymbol{\tau}) = \mathbf{0}, Cov(𝛕)=στ2𝐑\text{Cov}(\boldsymbol{\tau}) = \sigma^2_\tau \mathbf{R}, 𝐑\mathbf{R} is a matrix that determines the spatial dependence structure in 𝐲\mathbf{y} and depends on a range parameter, ϕ\phi. We discuss 𝐑\mathbf{R} in more detail shortly. The parameter στ2\sigma^2_\tau is called the spatially dependent random error variance or partial sill. The parameter σϵ2\sigma^2_\epsilon is called the spatially independent random error variance or nugget. These two variance parameters are henceforth more intuitively written as σde2\sigma^2_{de} and σie2\sigma^2_{ie}, respectively. The covariance of 𝐲\mathbf{y} is denoted 𝚺\boldsymbol{\Sigma} and given by σde2𝐑+σie2𝐈\sigma^2_{de} \mathbf{R} + \sigma^2_{ie} \mathbf{I}. The parameters that compose this covariance are contained in the vector 𝛉\boldsymbol{\theta}, which is called the covariance parameter vector.

The model 𝐲=𝐗𝛃+𝛕+𝛜\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon} is called the spatial linear model. The spatial linear model applies to both point-referenced and areal (i.e., lattice) data. Spatial data are point-referenced when the elements in 𝐲\mathbf{y} are observed at point-locations indexed by x-coordinates and y-coordinates on a spatially continuous surface with an infinite number of locations. The splm() function is used to fit spatial linear models for point-referenced data (these are sometimes called geostatistical models). One spatial covariance function available in splm() is the exponential spatial covariance function, which has an 𝐑\mathbf{R} matrix given by 𝐑=exp(𝐌/ϕ),\begin{equation*} \mathbf{R} = \exp(-\mathbf{M} / \phi), \end{equation*}
where 𝐌\mathbf{M} is a matrix of Euclidean distances among observations. Recall that ϕ\phi is the range parameter, controlling the behavior of of the covariance function as a function of distance. Parameterizations for splm() spatial covariance types and their 𝐑\mathbf{R} matrices can be seen by running help("splm", "spmodel") or vignette("technical", "spmodel"). Some of these spatial covariance types (e.g., Matérn) depend on an extra parameter beyond σde2\sigma^2_{de}, σie2\sigma^2_{ie}, and ϕ\phi.

Spatial data are areal when the elements in 𝐲\mathbf{y} are observed as part of a finite network of polygons whose connections are indexed by a neighborhood structure. For example, the polygons may represent counties in a state that are neighbors if they share at least one boundary. Areal data are often equivalently called lattice data (Cressie 1993). The spautor() function is used to fit spatial linear models for areal data (these are sometimes called spatial autoregressive models). One spatial autoregressive covariance function available in spautor() is the simultaneous autoregressive spatial covariance function, which has an 𝐑\mathbf{R} matrix given by 𝐑=[(𝐈ϕ𝐖)(𝐈ϕ𝐖)]1,\begin{equation*} \mathbf{R} = [(\mathbf{I} - \phi \mathbf{W})(\mathbf{I} - \phi \mathbf{W})^\top]^{-1}, \end{equation*} where 𝐖\mathbf{W} is a weight matrix describing the neighborhood structure in 𝐲\mathbf{y}. Parameterizations for spautor() spatial covariance types and their 𝐑\mathbf{R} matrices can be seen by running help("spautor", "spmodel") or vignette("technical", "spmodel").

One way to define 𝐖\mathbf{W} is through queen contiguity (Anselin, Syabri, and Kho 2010). Two observations are queen contiguous if they share a boundary. The ijijth element of 𝐖\mathbf{W} is then one if observation ii and observation jj are queen contiguous and zero otherwise. Observations are not considered neighbors with themselves, so each diagonal element of 𝐖\mathbf{W} is zero.

Sometimes each element in the weight matrix 𝐖\mathbf{W} is divided by its respective row sum. This is called row-standardization. Row-standardizing 𝐖\mathbf{W} has several benefits, which are discussed in detail by Ver Hoef et al. (2018).

AIC() and AICc()

The AIC() and AICc() functions in spmodel are defined for restricted maximum likelihood and maximum likelihood estimation, which maximize a likelihood. The AIC and AICc as defined by Hoeting et al. (2006) are given by AIC=2(𝚯̂)+2(|𝚯̂|)AICc=2(𝚯̂)+2n(|𝚯̂|)/(n|𝚯̂|1),\begin{equation*}\label{eq:sp_aic} \begin{split} \text{AIC} & = -2\ell(\hat{\boldsymbol{\Theta}}) + 2(|\hat{\boldsymbol{\Theta}}|) \\ \text{AICc} & = -2\ell(\hat{\boldsymbol{\Theta}}) + 2n(|\hat{\boldsymbol{\Theta}}|) / (n - |\hat{\boldsymbol{\Theta}}| - 1), \end{split} \end{equation*} where |𝚯̂||\hat{\boldsymbol{\Theta}}| is the cardinality of 𝚯̂\hat{\boldsymbol{\Theta}}. For restricted maximum likelihood, 𝚯̂{𝛉̂}\hat{\boldsymbol{\Theta}} \equiv \{\hat{\boldsymbol{\theta}}\}. For maximum likelihood, 𝚯̂{𝛉̂,𝛃̂}\hat{\boldsymbol{\Theta}} \equiv \{\hat{\boldsymbol{\theta}}, \hat{\boldsymbol{\beta}}\} The discrepancy arises because restricted maximum likelihood integrates the fixed effects out of the likelihood, and so the likelihood does not depend on 𝛃\boldsymbol{\beta}.

AIC comparisons between a model fit using restricted maximum likelihood and a model fit using maximum likelihood are meaningless, as the models are fit with different likelihoods. AIC comparisons between models fit using restricted maximum likelihood are only valid when the models have the same fixed effect structure. In contrast, AIC comparisons between models fit using maximum likelihood are valid when the models have different fixed effect structures.

anova()

Test statistics from are formed using the general linear hypothesis test. Let 𝐋\mathbf{L} be an l×pl \times p contrast matrix and l0l_0 be an l×1l \times 1 vector. The null hypothesis is that 𝐋𝛃̂=l0\mathbf{L} \boldsymbol{\hat{\beta}} = l_0 and the alternative hypothesis is that 𝐋𝛃̂l0\mathbf{L} \boldsymbol{\hat{\beta}} \neq l_0. Usually, l0l_0 is the zero vector (in spmodel, this is assumed). The test statistic is denoted Chi2Chi2 and is given by Chi2=[(𝐋𝛃̂l0)(𝐋(𝐗𝚺̂𝐗)1𝐋)1(𝐋𝛃̂l0)]\begin{equation*}\label{eq:glht} Chi2 = [(\mathbf{L} \boldsymbol{\hat{\beta}} - l_0)^\top(\mathbf{L} (\mathbf{X}^\top \mathbf{\hat{\Sigma}} \mathbf{X})^{-1} \mathbf{L}^\top)^{-1}(\mathbf{L} \boldsymbol{\hat{\beta}} - l_0)] \end{equation*} By default, 𝐋\mathbf{L} is chosen such that each variable in the data used to fit the model is tested marginally (i.e., controlling for the other variables) against l0=𝟎l_0 = \mathbf{0}. If this default is not desired, the and arguments can be used to pass user-defined 𝐋\mathbf{L} matrices to . They must be constructed in such a way that l0=𝟎l_0 = \mathbf{0}.

It is notoriously difficult to determine appropriate p-values for linear mixed models based on the general linear hypothesis test. lme4, for example, does not report p-values by default. A few reasons why obtaining p-values is so challenging:

  • The first (and often most important) challenge is that when estimating 𝛉\boldsymbol{\theta} using a finite sample, it is usually not clear what the null distribution of Chi2Chi2 is. In certain cases such as ordinary least squares regression or some experimental designs (e.g., blocked design, split plot design, etc.), Chi2/rank(𝐋)Chi2 / rank(\mathbf{L}) is F-distributed with known numerator and denominator degrees of freedom. But outside of these well-studied cases, no general results exist.
  • The second challenge is that the standard error of Chi2Chi2 does not account for the uncertainty in 𝛉̂\boldsymbol{\hat{\theta}}. For some approaches to addressing this problem, see Kackar and Harville (1984), Prasad and Rao (1990), Harville and Jeske (1992), and Kenward and Roger (1997).
  • The third challenge is in determining denominator degrees of freedom. Again, in some cases, these are known – but this is not true in general. For some approaches to addressing this problem, see Satterthwaite (1946), Schluchter and Elashoff (1990), Hrong-Tai Fai and Cornelius (1996), Kenward and Roger (1997), Littell et al. (2006), Pinheiro and Bates (2006), and Kenward and Roger (2009).

For these reasons, spmodel uses an asymptotic (i.e., large sample) Chi-squared test when calculating p-values using anova(). This approach addresses the three points above by assuming that with a large enough sample size:

  • Chi2Chi2 is asymptotically Chi-squared (under certain conditions) with rank(𝐋)rank(\mathbf{L}) degrees of freedom when the null hypothesis is true.
  • The uncertainty from estimating 𝛉̂\boldsymbol{\hat{\theta}} is small enough to be safely ignored.

Because the approximation is asymptotic, degree of freedom adjustments can be ignored (it is also worth noting that an F distribution with infinite denominator degrees of freedom is a Chi-squared distribution scaled by rank(𝐋)rank(\mathbf{L}). This asymptotic approximation implies these p-values are likely unreliable with small samples.

Note that when comparing full and reduced models, the general linear hypothesis test is analogous to an extra sum of (whitened) squares approach (Myers et al. 2012).

A second approach to determining p-values is a likelihood ratio test. Let (𝚯̂)\ell(\boldsymbol{\hat{\Theta}}) be the log-likelihood for some full model and (𝚯̂0)\ell(\boldsymbol{\hat{\Theta}}_0) be the log-likelihood for some reduced model. For the likelihood ratio test to be valid, the reduced model must be nested in the full model, which means that (𝚯̂0)\ell(\boldsymbol{\hat{\Theta}}_0) is obtained by fixing some parameters in 𝚯\boldsymbol{\Theta}. When the likelihood ratio test is valid, X2=2(𝚯̂)2(𝚯̂0)X^2 = 2\ell(\boldsymbol{\hat{\Theta}}) - 2\ell(\boldsymbol{\hat{\Theta}}_0) is asymptotically Chi-squared with degrees of freedom equal to the difference in estimated parameters between the full and reduced models.

For restricted maximum likelihood estimation, likelihood ratio tests can only be used to compare nested models with the same explanatory variables. To use likelihood ratio tests for comparing different explanatory variable structures, parameters must be estimated using maximum likelihood estimation. When using likelihood ratio tests to assess the importance of parameters on the boundary of a parameter space (e.g., a variance parameter being zero), p-values tend to be too large (Self and Liang 1987; Stram and Lee 1994; Goldman and Whelan 2000; Pinheiro and Bates 2006).

BIC()

The BIC() function in spmodel is defined for restricted maximum likelihood and maximum likelihood estimation, which maximize a likelihood. The BIC as defined by Schwarz (1978) is given by $$\begin{equation*}\label{eq:sp_aic} \text{BIC} & = -2\ell(\hat{\boldsymbol{\Theta}}) + \ln(n)(|\hat{\boldsymbol{\Theta}}|) \\ \end{equation*}$$ where nn is the sample size and |𝚯̂||\hat{\boldsymbol{\Theta}}| is the cardinality of 𝚯̂\hat{\boldsymbol{\Theta}}. For restricted maximum likelihood, 𝚯̂{𝛉̂}\hat{\boldsymbol{\Theta}} \equiv \{\hat{\boldsymbol{\theta}}\}. For maximum likelihood, 𝚯̂{𝛉̂,𝛃̂}\hat{\boldsymbol{\Theta}} \equiv \{\hat{\boldsymbol{\theta}}, \hat{\boldsymbol{\beta}}\} The discrepancy arises because restricted maximum likelihood integrates the fixed effects out of the likelihood, and so the likelihood does not depend on 𝛃\boldsymbol{\beta}.

BIC comparisons between a model fit using restricted maximum likelihood and a model fit using maximum likelihood are meaningless, as the models are fit with different likelihoods. BIC comparisons between models fit using restricted maximum likelihood are only valid when the models have the same fixed effect structure. In contrast, BIC comparisons between models fit using maximum likelihood are valid when the models have different fixed effect structures. While BIC was derived by Schwarz (1978) for independent data, Zimmerman and Ver Hoef (2024) show it can be useful for spatially-dependent data as well.

coef()

coef() returns relevant coefficients based on the type argument. When type = "fixed" (the default), coef() returns 𝛃̂=(𝐗𝚺̂1𝐗)1𝐗𝚺̂1𝐲.\begin{equation*} \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \hat{\boldsymbol{\Sigma}}^{-1} \mathbf{X})^{-1}\mathbf{X}^\top \hat{\boldsymbol{\Sigma}}^{-1} \mathbf{y} . \end{equation*} If the estimation method is restricted maximum likelihood or maximum likelihood, 𝛃̂\hat{\boldsymbol{\beta}} is known as the restricted maximum likelihood or maximum likelihood estimator of 𝛃\boldsymbol{\beta}. If the estimation method is semivariogram weighted least squares or semivariogram composite likelihood, 𝛃̂\hat{\boldsymbol{\beta}} is known as the empirical generalized least squares estimator of 𝛃\boldsymbol{\beta}. When type = "spcov", the estimated spatial covariance parameters are returned (available for all estimation methods). When type = "randcov", the estimated random effect variance parameters are returned (available for restricted maximum likelihood and maximum likelihood estimation).

confint()

confint() returns confidence intervals for estimated parameters. Currently, confint() only returns confidence intervals for 𝛃\boldsymbol{\beta}. The (1α)(1 - \alpha)% confidence interval for βi\beta_i is β̂i±z*(𝐗𝚺̂1𝐗)i,i1,\begin{equation*} \hat{\beta}_i \pm z^* \sqrt{(\mathbf{X}^\top \hat{\boldsymbol{\Sigma}}^{-1} \mathbf{X})^{-1}_{i, i}}, \end{equation*} where (𝐗𝚺̂1𝐗)i,i1(\mathbf{X}^\top \hat{\boldsymbol{\Sigma}}^{-1} \mathbf{X})^{-1}_{i, i} is the iith diagonal element in (𝐗𝚺̂1𝐗)1(\mathbf{X}^\top \hat{\boldsymbol{\Sigma}}^{-1} \mathbf{X})^{-1}, Φ(z*)=1α/2\Phi(z^*) = 1 - \alpha / 2, Φ()\Phi(\cdot) is the standard normal (Gaussian) cumulative distribution function, and α=1\alpha = 1 -level, where level is an argument to confint(). The default for level is 0.95, which corresponds to a z*z^* of approximately 1.96.

cooks.distance()

Cook’s distance measures the influence of an observation (Cook 1979; Cook and Weisberg 1982). An influential observation has a large impact on the model fit. The vector of Cook’s distances for the spatial linear model is given by 𝐞p2pdiag(𝐇s)11diag(𝐇s),\begin{equation} \label{eq:cooksd} \frac{\mathbf{e}_p^2}{p} \odot diag(\mathbf{H}_s) \odot \frac{1}{1 - diag(\mathbf{H}_s)}, \end{equation} where 𝐞p\mathbf{e}_p are the Pearson residuals and diag(𝐇s)diag(\mathbf{H}_s) is the diagonal of the spatial hat matrix, 𝐇s𝐗*(𝐗*𝐗*)1𝐗*\mathbf{H}_s \equiv \mathbf{X}^* (\mathbf{X}^{* \top} \mathbf{X}^*)^{-1} \mathbf{X}^{* \top}(Montgomery, Peck, and Vining 2021), and \odot denotes the Hadmard (element-wise) product. The larger the Cook’s distance, the larger the influence.

To better understand the previous form, recall that the the non-spatial linear model 𝐲=𝐗𝛃+𝛜\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} assumes elements of 𝛜\boldsymbol{\epsilon} are independent and identically distributed (iid) with constant variance. In this context the vector of non-spatial Cook’s distances is given by 𝐞p2pdiag(𝐇)11diag(𝐇),\begin{equation*} \frac{\mathbf{e}_p^2}{p} \odot diag(\mathbf{H}) \odot \frac{1}{1 - diag(\mathbf{H})}, \end{equation*} where diag(𝐇)diag(\mathbf{H}) is the diagonal of the non-spatial hat matrix, 𝐇𝐗(𝐗𝐗)1𝐗\mathbf{H} \equiv \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}. When the elements of 𝛜\boldsymbol{\epsilon} are not iid or do not have constant variance or both, the spatial Cook’s distance cannot be calculated using 𝐇\mathbf{H}. First the linear model must be whitened according to 𝐲*=𝐗*𝛃+𝛜*\mathbf{y}^* = \mathbf{X}^* \boldsymbol{\beta} + \boldsymbol{\epsilon}^*, where 𝛜*\boldsymbol{\epsilon}^* is the whitened version of the sum of all random errors in the model. Then the spatial Cook’s distance follows using 𝐗*\mathbf{X}^*, the whitened version of 𝐗\mathbf{X}.

deviance()

The deviance of a fitted model is 𝒟𝚯=2(𝚯s)2(𝚯̂),\begin{equation*} \mathcal{D}_{\boldsymbol{\Theta}} = 2\ell(\boldsymbol{\Theta}_s) - 2\ell(\boldsymbol{\hat{\Theta}}), \end{equation*} where (𝚯s)\ell(\boldsymbol{\Theta}_s) is the log-likelihood of a “saturated” model that fits every observation perfectly. For normal (Gaussian) random errors, 𝒟𝚯=(𝐲𝐗𝛃̂)𝚺̂1(𝐲𝐗𝛃̂)\begin{equation*} \mathcal{D}_{\boldsymbol{\Theta}} = (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})^\top \hat{\boldsymbol{\Sigma}}^{-1} (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}) \end{equation*}

fitted()

Fitted values can be obtained for the response, spatial random errors, and random effects. The fitted values for the response (type = "response"), denoted 𝐲̂\mathbf{\hat{y}}, are given by 𝐲̂=𝐗𝛃̂.\begin{equation*}\label{eq:fit_resp} \mathbf{\hat{y}} = \mathbf{X} \boldsymbol{\hat{\beta}} . \end{equation*} They are the estimated mean response given the set of explanatory variables for each observation.

Fitted values for spatial random errors (type = "spcov") and random effects (type = "randcov") are linked to best linear unbiased predictors from linear mixed model theory. Consider the standard random effects parameterization 𝐲=𝐗𝛃+𝐙𝐮+𝛜,\begin{equation*} \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \mathbf{u} + \boldsymbol{\epsilon}, \end{equation*} where 𝐙\mathbf{Z} denotes the random effects design matrix, 𝐮\mathbf{u} denotes the random effects, and 𝛜\boldsymbol{\epsilon} denotes independent random error. Henderson (1975) states that the best linear unbiased predictor (BLUP) of a single random effect vector 𝐮\mathbf{u}, denoted 𝐮̂\mathbf{\hat{u}}, is given by 𝐮̂=σu2𝐙𝚺1(𝐲𝐗𝛃̂),\begin{equation}\label{eq:blup_mm} \mathbf{\hat{u}} = \sigma^2_u \mathbf{Z}^\top \mathbf{\Sigma}^{-1}(\mathbf{y} - \mathbf{X} \boldsymbol{\hat{\beta}}), \end{equation} where σu2\sigma^2_u is the variance of 𝐮\mathbf{u}.

Searle, Casella, and McCulloch (2009) generalize this idea by showing that for a random vector 𝛂\boldsymbol{\alpha} in a linear model, the best linear unbiased predictor (based on the response, 𝐲\mathbf{y}) of 𝛂\boldsymbol{\alpha}, denoted 𝛂̂\boldsymbol{\hat{\alpha}}, is given by 𝛂̂=E(𝛂)+𝚺α𝚺1(𝐲𝐗𝛃̂),\begin{equation}\label{eq:blup_gen} \boldsymbol{\hat{\alpha}} = \text{E}(\boldsymbol{\alpha}) + \boldsymbol{\Sigma}_\alpha \boldsymbol{\Sigma}^{-1}(\mathbf{y} - \mathbf{X} \boldsymbol{\hat{\beta}}), \end{equation} where 𝚺α=Cov(𝛂,𝐲)\boldsymbol{\Sigma}_\alpha = \text{Cov}(\boldsymbol{\alpha}, \mathbf{y}). Evaluating this equation at the plug-in (empirical) estimates of the covariance parameters yields the empirical best linear unbiased predictor (EBLUP) of 𝛂\boldsymbol{\alpha}.

Recall that the spatial linear model with random effects is 𝐲=𝐗𝛃+𝐙𝐮+𝛕+𝛜,\begin{equation*} \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \mathbf{u} + \boldsymbol{\tau} + \boldsymbol{\epsilon}, \end{equation*} Building from previous results, we can find BLUPs for each random term in the spatial linear model (𝐮\mathbf{u}, 𝛕\boldsymbol{\tau}, and 𝛜\boldsymbol{\epsilon}). For example, the BLUP of 𝐮\mathbf{u} is found by noting that E(𝐮)=𝟎\text{E}(\mathbf{u}) = \mathbf{0} and 𝚺u=Cov(𝐮,𝐲)=Cov(𝐮,𝐗𝛃+𝐙𝐮+𝛕+𝛜)=Cov(𝐮,𝐙𝐮)=Cov(𝐮,𝐮)𝐙=σu2𝐙,\begin{equation*} \mathbf{\Sigma}_u = \text{Cov}(\mathbf{u}, \mathbf{y}) = \text{Cov}(\mathbf{u}, \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \mathbf{u} + \boldsymbol{\tau} + \boldsymbol{\epsilon}) = \text{Cov}(\mathbf{u}, \mathbf{Z}\mathbf{u}) = \text{Cov}(\mathbf{u}, \mathbf{u})\mathbf{Z}^\top = \sigma^2_u \mathbf{Z}^\top, \end{equation*} where the result follows because the random terms in 𝐲\mathbf{y} are independent and Cov(𝐮,𝐮)=σu2𝐈\text{Cov}(\mathbf{u}, \mathbf{u}) = \sigma^2_u \mathbf{I}. Then it follows that 𝐮̂=E(𝐮)+𝚺u𝚺1(𝐲𝐗𝛃̂)=σu2𝐙𝚺1(𝐲𝐗𝛃̂),\begin{equation*} \hat{\mathbf{u}} = \text{E}(\mathbf{u}) + \boldsymbol{\Sigma}_u \boldsymbol{\Sigma}^{-1}(\mathbf{y} - \mathbf{X} \boldsymbol{\hat{\beta}}) = \sigma^2_u \mathbf{Z}^\top \boldsymbol{\Sigma}^{-1}(\mathbf{y} - \mathbf{X} \boldsymbol{\hat{\beta}}), \end{equation*} which matches the previous form of the BLUP. Similarly, the BLUP of 𝛕\boldsymbol{\tau} is found by noting that E(𝛕)=𝟎\text{E}(\boldsymbol{\tau}) = \mathbf{0} and 𝚺de=Cov(𝛕,𝐲)=Cov(𝛕,𝐗𝛃+𝐙𝐮+𝛕+𝛜)=Cov(𝛕,𝛕)=σde2𝐑,\begin{equation*} \mathbf{\Sigma}_{de} = \text{Cov}(\boldsymbol{\tau}, \mathbf{y}) = \text{Cov}(\boldsymbol{\tau}, \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \mathbf{u} + \boldsymbol{\tau} + \boldsymbol{\epsilon}) = \text{Cov}(\boldsymbol{\tau}, \boldsymbol{\tau}) = \sigma^2_{de} \mathbf{R}, \end{equation*} where the result follows because the random terms in 𝐲\mathbf{y} are independent and Cov(𝛕,𝛕)=σde2𝐑\text{Cov}(\boldsymbol{\tau}, \boldsymbol{\tau}) = \sigma^2_{de} \mathbf{R}, and σde2\sigma^2_{de} is the variance of 𝛕\boldsymbol{\tau}. Then it follows that 𝛕̂=E(𝛕)+𝚺de𝚺1(𝐲𝐗𝛃̂)=σde2𝐑𝚺1(𝐲𝐗𝛃̂).\begin{equation}\label{eq:blup_sp} \hat{\boldsymbol{\tau}} = \text{E}(\boldsymbol{\tau}) + \boldsymbol{\Sigma}_{de} \boldsymbol{\Sigma}^{-1}(\mathbf{y} - \mathbf{X} \boldsymbol{\hat{\beta}}) = \sigma^2_{de} \mathbf{R} \boldsymbol{\Sigma}^{-1}(\mathbf{y} - \mathbf{X} \boldsymbol{\hat{\beta}}). \end{equation} Fitted values for 𝛜\boldsymbol{\epsilon} are obtained using similar arguments. Evaluating these equations at the plug-in (empirical) estimates of the covariance parameters yields EBLUPs.

When partition factors are used, the covariance matrix of all random effects (spatial and non-spatial) can be viewed as the interaction between the non-partitioned covariance matrix and the partition matrix, 𝐏\mathbf{P}. The ijijth entry in 𝐏\mathbf{P} equals one if observation ii and observation jj share the same level of the partition factor and zero otherwise. For spatial random effects, an adjustment is straightforward, as each column in 𝚺de\boldsymbol{\Sigma}_{de} corresponds to a distinct spatial random effect. Thus with partition factors, 𝚺de*=𝚺de𝐏=σde2𝐑𝐏\boldsymbol{\Sigma}_{de}^* = \boldsymbol{\Sigma}_{de} \odot \mathbf{P} = \sigma^2_{de} \mathbf{R} \odot \mathbf{P}, where \odot denotes the Hadmart (element-wise) product, is used instead of 𝚺de\boldsymbol{\Sigma}_{de}. Note that 𝚺ie\boldsymbol{\Sigma}_{ie} is unchanged as it is proportional to the identity matrix. For non-spatial random effects, however, the situation is more complicated. Applying the BLUP formula directly yields BLUPs of random effects corresponding to the interaction between random effect levels and partition levels. Thus a logical approach is to average the non-zero BLUPs for each random effect level across partition levels, yielding a prediction for the random effect level. This does not imply, however, that these estimates are BLUPs of the random effect.

For big data without partition factors, the local indexes act as partition factors. That is, the BLUPs correspond to random effects interacted with each local index. For big data with partition factors, an adjusted partition factor is created as the interaction between each local index and the partition factor. Then this adjusted partition factor is applied to yield 𝛂̂\hat{\boldsymbol{\alpha}}.

hatvalues()

Hat values measure the leverage of an observation. An observation has high leverage if its combination of explanatory variables is atypical (far from the mean explanatory vector). The spatial leverage (hat) matrix is given by 𝐇s=𝐗*(𝐗*𝐗*)1𝐗*.\begin{equation} \label{eq:leverage} \mathbf{H}_s = \mathbf{X}^* (\mathbf{X}^{* \top} \mathbf{X}^*)^{-1} \mathbf{X}^{* \top}. \end{equation} The diagonal of this matrix yields the leverage (hat) values for each observation (Montgomery, Peck, and Vining 2021). The larger the hat value, the larger the leverage.

To better understand the previous form, recall that the the non-spatial linear model 𝐲=𝐗𝛃+𝛜\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} assumes elements of 𝛜\boldsymbol{\epsilon} are independent and identically distributed (iid) with constant variance. In this context, the leverage (hat) matrix is given by 𝐇𝐗(𝐗𝐗)1𝐗,\begin{equation*} \mathbf{H} \equiv \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}, \end{equation*} When the elements of 𝛜\boldsymbol{\epsilon} are not iid or do not have constant variance or both, the spatial leverage (hat) matrix is not 𝐇\mathbf{H}. First the linear model must be whitened according to 𝐲*=𝐗*𝛃+𝛜*\mathbf{y}^* = \mathbf{X}^* \boldsymbol{\beta} + \boldsymbol{\epsilon}^*, where 𝛜*\boldsymbol{\epsilon}^* is the whitened version of the sum of all random errors in the model. Then the spatial leverage (hat) matrix follows using 𝐗*\mathbf{X}^*, the whitened version of 𝐗\mathbf{X}.

logLik()

The log-likelihood is given by (𝚯̂)\ell(\boldsymbol{\hat{\Theta}}).

loocv()

kk-fold cross validation is a useful tool for evaluating model fits using “hold-out” data. The data are split into kk sets. One-by-one, one of the kk sets is held out, the model is fit to the remaining k1k - 1 sets, and predictions at each observation in the hold-out set are compared to their true values. The closer the predictions are to the true observations, the better the model fit. A special case where k=nk = n is known as leave-one-out cross validation (loocv), as each observation is left out one-by-one. Computationally efficient solutions exist for leave-one-out cross validation in the non-spatial linear model (with iid, constant variance errors). Outside of this case, however, fitting nn separate models can be computationally infeasible. loocv() makes a compromise that balances an approximation to the true solution with computational feasibility. First 𝛉\boldsymbol{\theta} is estimated using all of the data. Then for each of the nn model fits, loocv() does not re-estimate 𝛉\boldsymbol{\theta} but does re-estimate 𝛃\boldsymbol{\beta}. This approach relies on the assumption that the covariance parameter estimates obtained using n1n - 1 observations are approximately the same as the covariance parameter estimates obtained using all nn observations. For a large enough sample size, this is a reasonable assumption.

First define 𝚺i,i\boldsymbol{\Sigma}_{-i, -i} as 𝚺\boldsymbol{\Sigma} with the iith row and column deleted, 𝚺i,i\boldsymbol{\Sigma}_{i, -i} as the iith row of 𝚺\boldsymbol{\Sigma} with the iith column deleted, 𝚺i,i\boldsymbol{\Sigma}_{i, i} as the iith diagonal element of 𝚺\boldsymbol{\Sigma}, 𝐗i\mathbf{X}_{-i} as 𝐗\mathbf{X} with the iith row deleted, 𝐗i\mathbf{X}_{i} as the iith row of 𝐗\mathbf{X}, yiy_{-i} as 𝐲\mathbf{y} with the iith element deleted, and 𝐲i\mathbf{y}_i as the iith element of 𝐲\mathbf{y}. Wolf (1978) shows that given 𝚺1\boldsymbol{\Sigma}^{-1}, a computationally efficient form for 𝚺i1\boldsymbol{\Sigma}^{-1}_{-i} exists. First observe that 𝚺1\boldsymbol{\Sigma}^{-1} can be represented blockwise as 𝚺1=[𝚺̃i,i𝚺̃i,i𝚺̃i,i𝚺̃i,i],\begin{equation*} \boldsymbol{\Sigma}^{-1} = \begin{bmatrix} \tilde{\boldsymbol{\Sigma}}_{-i, -i} & \tilde{\boldsymbol{\Sigma}}_{i,-i}^\top \\ \tilde{\boldsymbol{\Sigma}}_{i,-i} & \tilde{\boldsymbol{\Sigma}}_{i, i} \end{bmatrix}, \end{equation*} where the dimensions of each 𝚺̃\tilde{\boldsymbol{\Sigma}} match the respective dimensions of relevant blocks in 𝚺\boldsymbol{\Sigma}. Then it follows that 𝚺i,i1=𝚺̃i,i𝚺̃i,i𝚺̃i,i1𝚺̃i,i\begin{equation*} \boldsymbol{\Sigma}^{-1}_{-i, -i} = \tilde{\boldsymbol{\Sigma}}_{-i, -i} - \tilde{\boldsymbol{\Sigma}}_{i,-i}^\top \tilde{\boldsymbol{\Sigma}}_{i, i}^{-1}\tilde{\boldsymbol{\Sigma}}_{i,-i} \end{equation*} and 𝛃i=(𝐗i𝚺i,i1𝐗i)1𝐗i𝚺i,i1𝐲i,\begin{equation*} \boldsymbol{\beta}_{-i} = (\mathbf{X}^\top_{-i} \boldsymbol{\Sigma}^{-1}_{-i, -i} \mathbf{X}_{-i})^{-1} \mathbf{X}^\top_{-i} \boldsymbol{\Sigma}^{-1}_{-i, -i} \mathbf{y}_{-i}, \end{equation*} where 𝛃i\boldsymbol{\beta}_i is the estimate of 𝛃\boldsymbol{\beta} constructed without the iith observation.

The loocv prediction of yiy_i is then given by ŷi=𝐗i𝛃̂i+𝚺̂i,i𝚺̂i,i(𝐲i𝐗i𝛃̂i)\begin{equation*} \hat{y}_i = \mathbf{X}_i \hat{\boldsymbol{\beta}}_{-i} + \hat{\boldsymbol{\Sigma}}_{i, -i}\hat{\boldsymbol{\Sigma}}_{-i, -i}(\mathbf{y}_i - \mathbf{X}_{-i} \hat{\boldsymbol{\beta}}_{-i}) \end{equation*} and the prediction variance of the loocv prediction of yiy_i is given by σ̇i2=𝚺̂i,i𝚺̂i,i𝚺̂i,i1𝚺̂i,i+𝐐i(𝐗i𝚺̂i,i1𝐗i)1𝐐i,\begin{equation*} \dot{\sigma}^2_i = \hat{\boldsymbol{\Sigma}}_{i, i} - \hat{\boldsymbol{\Sigma}}_{i, - i} \hat{\boldsymbol{\Sigma}}^{-1}_{-i, -i} \hat{\boldsymbol{\Sigma}}_{i, - i}^\top + \mathbf{Q}_i(\mathbf{X}_{-i}^\top \hat{\boldsymbol{\Sigma}}_{-i, -i}^{-1} \mathbf{X}_{-i})^{-1}\mathbf{Q}_i^\top , \end{equation*} where 𝐐i=𝐗i𝚺̂i,i𝚺̂i,i1𝐗i\mathbf{Q}_i = \mathbf{X}_i - \hat{\boldsymbol{\Sigma}}_{i, -i} \hat{\boldsymbol{\Sigma}}^{-1}_{-i, -i} \mathbf{X}_{-i}. These formulas are analogous to the formulas used to obtain linear unbiased predictions of unobserved data and prediction variances. Model fits are evaluated using several statistics: bias, mean-squared-prediction error (MSPE), root-mean-squared-prediction error (RMSPE), and the squared correlation (cor2) between the observed data and leave-one-out predictions (regarded as a prediction version of r-squared appropriate for comparing across spatial and nonspatial models).

Bias is formally defined as bias=1ni=1n(yiŷi).\begin{equation*} bias = \frac{1}{n}\sum_{i = 1}^n(y_i - \hat{y}_i). \end{equation*}

MSPE is formally defined as MSPE=1ni=1n(yiŷi)2.\begin{equation*} MSPE = \frac{1}{n}\sum_{i = 1}^n(y_i - \hat{y}_i)^2. \end{equation*}

RMSPE is formally defined as RMSPE=1ni=1n(yiŷi)2.\begin{equation*} RMSPE = \sqrt{\frac{1}{n}\sum_{i = 1}^n(y_i - \hat{y}_i)^2}. \end{equation*}

cor2 is formally defined as cor2=Cor(𝐲,𝐲̂)2,\begin{equation*} cor2 = \text{Cor}(\mathbf{y}, \hat{\mathbf{y}})^2, \end{equation*} where Cor()(\cdot) is the correlation function. cor2 is only returned for spatial linear models, as it is not applicable for spatial generalized linear models (we are predicting a latent mean parameter, which is unknown and not on the same scale as the original data).

Generally, bias should be near zero for well-fitting models. The lower the MSPE and RMSPE, the better the model fit. The higher the cor2, the better the model fit.

Big Data

Options for big data leave-one-out cross validation rely on the local argument, which is passed to predict(). The local list for predict() is explained in detail in the predict() section, but we provide a short summary of how local interacts with loocv() here.

For splm() and spautor() objects, the local method can be "all". When the local method is"all", all of the data are used for leave-one-out cross validation (i.e., it is implemented exactly as previously described). Parallelization is implemented when setting parallel = TRUE in local, and the number of cores to use for parallelization is specified via ncores.

For splm() objects, additional options for the local method are "covariance" and "distance". When the local method is "covariance", then a number of observations (specified via the size argument) having the highest covariance with the held-out observation are used in the local neighborhood prediction approach. When the local method is "distance", then a number of observations (specified via the size argument) closest to the held-out observation are used in the local neighborhood prediction approach. When no random effects are used, no partition factor is used, and the spatial covariance function is monotone decreasing, "covariance" and "distance" are equivalent. The local neighborhood approach only uses the observations in the local neighborhood of the held-out observation to perform prediction, and is thus an approximation to the true solution. Its computational efficiency derives from using 𝚺l,l\boldsymbol{\Sigma}_{l, l} (the covariance matrix of the observations in the local neighborhood) instead of 𝚺\boldsymbol{\Sigma} (the covariance matrix of all the observations). Parallelization is implemented when setting parallel = TRUE in local, and the number of cores to use for parallelization is specified via ncores.

predict()

interval = "none"

The empirical best linear unbiased predictions (i.e., empirical Kriging predictor) of 𝐲u\mathbf{y}_u are given by 𝐲̇u=𝐗u𝛃̂+𝚺̂uo𝚺̂o1(𝐲o𝐗o𝛃̂).\begin{equation}\label{eq:blup} \mathbf{\dot{y}}_u = \mathbf{X}_u \hat{\boldsymbol{\beta}} + \hat{\boldsymbol{\Sigma}}_{uo} \hat{\boldsymbol{\Sigma}}^{-1}_{o} (\mathbf{y}_o - \mathbf{X}_o \hat{\boldsymbol{\beta}}) . \end{equation}

This equation sometimes called an empirical universal Kriging predictor, a Kriging with external drift predictor, or a regression Kriging predictor.

The covariance matrix of 𝐲̇u\mathbf{\dot{y}}_u𝚺̇u=𝚺̂u𝚺̂uo𝚺̂o1𝚺̂uo+𝐐(𝐗o𝚺̂o1𝐗o)1𝐐,\begin{equation}\label{eq:blup_cov} \dot{\boldsymbol{\Sigma}}_u = \hat{\boldsymbol{\Sigma}}_u - \hat{\boldsymbol{\Sigma}}_{uo} \hat{\boldsymbol{\Sigma}}^{-1}_o \hat{\boldsymbol{\Sigma}}^\top_{uo} + \mathbf{Q}(\mathbf{X}_o^\top \hat{\boldsymbol{\Sigma}}_o^{-1} \mathbf{X}_o)^{-1}\mathbf{Q}^\top , \end{equation} where 𝐐=𝐗u𝚺̂uo𝚺̂o1𝐗o\mathbf{Q} = \mathbf{X}_u - \hat{\boldsymbol{\Sigma}}_{uo} \hat{\boldsymbol{\Sigma}}^{-1}_o \mathbf{X}_o.

When se.fit = TRUE, standard errors are returned by taking the square root of the diagonal of 𝚺̇u\dot{\boldsymbol{\Sigma}}_u.

interval = "prediction"

The empirical best linear unbiased predictions are returned as 𝐲̇u\mathbf{\dot{y}}_u. The (100 ×\timeslevel)% prediction interval for (yu)i(y_u)_i is (ẏu)i±z*(𝚺̇u)i,i(\dot{y}_u)_i \pm z^* \sqrt{(\dot{\boldsymbol{\Sigma}}_u)_{i, i}}, where (𝚺̇u)i,i\sqrt{(\dot{\boldsymbol{\Sigma}}_u)_{i, i}} is the standard error of (ẏu)i(\dot{y}_u)_i obtained from se.fit = TRUE, Φ(z*)=1α/2\Phi(z^*) = 1 - \alpha / 2, Φ()\Phi(\cdot) is the standard normal (Gaussian) cumulative distribution function, α=1\alpha = 1 -level, and level is an argument to predict(). The default for level is 0.95, which corresponds to a z*z^* of approximately 1.96.

interval = "confidence"

The best linear unbiased estimates of E[(yu)i]\text{E}[(y_u)_i] (E()\text{E}(\cdot) denotes expectation) are returned by evaluating (𝐗u)i𝛃̂(\mathbf{X}_u)_i \hat{\boldsymbol{\beta}}, where (𝐗u)i(\mathbf{X}_u)_i is the iith row of 𝐗u\mathbf{X}_u (i.e., fitted values corresponding to (𝐗u)i(\mathbf{X}_u)_i are returned). The (100 ×\timeslevel)% confidence interval for E[(yu)i]\text{E}[(y_u)_i] is (𝐗u)i𝛃̂±z*(𝐗u)i(𝐗o𝚺̂o1𝐗o)1(𝐗u)i(\mathbf{X}_u)_i \hat{\boldsymbol{\beta}} \pm z^* \sqrt{(\mathbf{X}_u)_i (\mathbf{X}^\top_o \hat{\boldsymbol{\Sigma}}_o^{-1} \mathbf{X}_o)^{-1} (\mathbf{X}_u)_i^\top} where (𝐗u)i(𝐗o𝚺̂o1𝐗o)1(𝐗u)i\sqrt{(\mathbf{X}_u)_i (\mathbf{X}^\top_o \hat{\boldsymbol{\Sigma}}_o^{-1} \mathbf{X}_o)^{-1} (\mathbf{X}_u)_i^\top} is the standard error of (ẏu)i(\dot{y}_u)_i obtained from se.fit = TRUE, Φ(z*)=1α/2\Phi(z^*) = 1 - \alpha / 2, Φ()\Phi(\cdot) is the standard normal (Gaussian) cumulative distribution function, α=1\alpha = 1 -level, and level is an argument to predict(). The default for level is 0.95, which corresponds to a z*z^* of approximately 1.96.

spautor() extra steps

For spatial autoregressive models, an extra step is required to obtain 𝚺̂o1\hat{\boldsymbol{\Sigma}}^{-1}_o, 𝚺̂u\hat{\boldsymbol{\Sigma}}_u, and 𝚺̂uo\hat{\boldsymbol{\Sigma}}_{uo} as they depend on one another through the neighborhood structure of 𝐲o\mathbf{y}_o and 𝐲u\mathbf{y}_u. Recall that for autoregressive models, it is 𝚺1\boldsymbol{\Sigma}^{-1} that is straightforward to obtain, not 𝚺\boldsymbol{\Sigma}.

Let 𝚺1\boldsymbol{\Sigma}^{-1} be the inverse covariance matrix of the observed and unobserved data, 𝐲o\mathbf{y}_o and 𝐲u\mathbf{y}_u. One approach to obtain 𝚺o\boldsymbol{\Sigma}_o and 𝚺uo\boldsymbol{\Sigma}_{uo} is to directly invert 𝚺1\boldsymbol{\Sigma}^{-1} and then subset 𝚺\boldsymbol{\Sigma} appropriately. This inversion can be prohibitive when no+nun_o + n_u is large. A faster way to obtain 𝚺o\boldsymbol{\Sigma}_o and 𝚺uo\boldsymbol{\Sigma}_{uo} exists. Represent 𝚺1\boldsymbol{\Sigma}^{-1} blockwise as 𝚺1=[𝚺̃o𝚺̃uo𝚺̃uo𝚺̃u],\begin{equation*}\label{eq:auto_hw} \boldsymbol{\Sigma}^{-1} = \begin{bmatrix} \tilde{\boldsymbol{\Sigma}}_{o} & \tilde{\boldsymbol{\Sigma}}^{\top}_{uo} \\ \tilde{\boldsymbol{\Sigma}}_{uo} & \tilde{\boldsymbol{\Sigma}}_{u} \end{bmatrix}, \end{equation*} where the dimensions of the blocks match the relevant dimensions of 𝚺\boldsymbol{\Sigma}. All of the terms required for prediction can be obtained from this block representation. Wolf (1978) shows that 𝚺o1=𝚺̃o𝚺̃uo(𝚺̃u)1𝚺̃uo𝚺u=(𝚺̃u𝚺̃uo(𝚺̃o)1𝚺̃uo)1𝚺uo=𝚺u𝚺̃uo𝚺̃o1\begin{equation*}\label{eq:hw_forms} \begin{split} \boldsymbol{\Sigma}^{-1}_o & = \tilde{\boldsymbol{\Sigma}}_{o} - \tilde{\boldsymbol{\Sigma}}^{ \top}_{uo} (\tilde{\boldsymbol{\Sigma}}_{u})^{-1} \tilde{\boldsymbol{\Sigma}}_{uo} \\ \boldsymbol{\Sigma}_u & = (\tilde{\boldsymbol{\Sigma}}_{u} - \tilde{\boldsymbol{\Sigma}}_{uo} (\tilde{\boldsymbol{\Sigma}}_{o})^{-1} \tilde{\boldsymbol{\Sigma}}^\top_{uo})^{-1} \\ \boldsymbol{\Sigma}_{uo} & = - \boldsymbol{\Sigma}_u \tilde{\boldsymbol{\Sigma}}_{uo} \tilde{\boldsymbol{\Sigma}}^{-1}_{o} \end{split} \end{equation*} Evaluating these expressions at 𝛉̂\hat{\boldsymbol{\theta}} yields 𝚺̂o1\hat{\boldsymbol{\Sigma}}^{-1}_o, and 𝚺̂u\hat{\boldsymbol{\Sigma}}_u, and 𝚺̂uo\hat{\boldsymbol{\Sigma}}_{uo}.

A similar result exists for the log determinant of 𝚺o\boldsymbol{\Sigma}_o, which is not required for prediction but is required for restricted maximum likelihood and maximum likelihood estimation.

Big Data

When the number of observations in the fitted model (observed data) are large or there are many locations to predict or both, it is often necessary to implement computationally efficient big data approximations. Big data approximations are implemented in spmodel using the local argument to predict(). When the local method is "all", all of the fitted model data are used to make predictions. In this context, computational efficiency is only gained by parallelizing each prediction. The only available local method for spautor() fitted models is "all". This is because the neighborhood structure of spautor() fitted models does not permit the subsetting used by the "covariance" and "distance" methods that we discuss next.

When the local method is "covariance", 𝚺̂uo\hat{\boldsymbol{\Sigma}}_{uo} is computed between the observation being predicted (𝐲u\mathbf{y}_u) and the rest of the observed data. This vector is then ordered and a number of observations (specified via the size argument) having the highest covariance with 𝐲u\mathbf{y}_u are subset, yielding 𝚺̌uo\check{\boldsymbol{\Sigma}}_{uo}, which has dimension 1×size1 \times size. Then similarly 𝚺̂o\hat{\boldsymbol{\Sigma}}_o, 𝐲o\mathbf{y}_o, and 𝐗u\mathbf{X}_u are also subset by these size observations, yielding 𝚺̌o\check{\boldsymbol{\Sigma}}_{o}, 𝐲̌o\check{\mathbf{y}}_o, and 𝐗̌u\check{\mathbf{X}}_u, respectively. The previous prediction equations can be evaluated at 𝚺̌uo\check{\boldsymbol{\Sigma}}_{uo}, 𝚺̌o\check{\boldsymbol{\Sigma}}_{o}, 𝐲̌o\check{\mathbf{y}}_o, and 𝐗̌u\check{\mathbf{X}}_u (except for the quantity (𝐗o𝚺̂o1𝐗o)1(\mathbf{X}_o^\top \hat{\boldsymbol{\Sigma}}_o^{-1} \mathbf{X}_o)^{-1}, which is evaluated using all the observed data) to yield predictions and standard errors. When the local method is "distance", a similar approach is used except a number of observations (specified via the size argument) closest (in terms of Euclidean distance) to 𝐲u\mathbf{y}_u are subset instead. When random effects are not used, partition factors are not used, and the spatial covariance function is monotone decreasing, "covariance" and "distance" are equivalent. This approach of subsetting the observed data by the set of locations closest in covariance or proximity to 𝐲u\mathbf{y}_u is known as the local neighborhood approach. As long as size is relatively small (the default is 100), the local neighborhood approach is very computationally efficient, mainly because 𝚺̌o1\check{\boldsymbol{\Sigma}}_{o}^{-1} is easy to compute. Additional computational efficiency is gained by parallelizing each prediction.

splmRF() and spautorRF()

Random forest spatial residual model predictions are obtained by combining random forest predictions and spatial linear model predictions (i.e., Kriging) of the random forest residuals. Formally, the random forest spatial residual model predictions of 𝐲u\mathbf{y}_u are given by 𝐲̇u=𝐲̇u,rf+𝐞̇u,slm,\begin{equation*} \mathbf{\dot{y}}_u = \mathbf{\dot{y}}_{u, rf} + \mathbf{\dot{e}}_{u, slm}, \end{equation*} where 𝐲̇u,rf\mathbf{\dot{y}}_{u, rf} are the random forest predictions for 𝐲u\mathbf{y}_u and 𝐞̇u,slm\mathbf{\dot{e}}_{u, slm} are the spatial linear model predictions of the random forest residuals for 𝐲u\mathbf{y}_u. This process of obtaining predictions is sometimes analogously called random forest regression Kriging (Fox, Ver Hoef, and Olsen 2020).

Uncertainty quantification in a random forest context has been studied (Meinshausen and Ridgeway 2006) but is not currently available in spmodel. Big data are accommodated by supplying the local argument to predict().

pseudoR2()

The pseudo R-squared is a generalization of the classical R-squared from non-spatial linear models. Like the classical R-squared, the pseudo R-squared measures the proportion of variability in the response explained by the fixed effects in the fitted model. Unlike the classical R-squared, the pseudo R-squared can be applied to models whose errors do not satisfy the iid and constant variance assumption. The pseudo R-squared is given by PR2=1𝒟(𝚯̂)𝒟(𝚯̂0).\begin{equation*} PR2 = 1 - \frac{\mathcal{D}(\boldsymbol{\hat{\Theta}})}{\mathcal{D}(\boldsymbol{\hat{\Theta}}_0)}. \end{equation*} For normal (Gaussian) random errors, the pseudo R-squared is PR2=1(𝐲𝐗𝛃̂)𝚺̂1(𝐲𝐗𝛃̂)(𝐲μ̂)𝚺̂1(𝐲μ̂),\begin{equation*} PR2 = 1 - \frac{(\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})^\top \hat{\boldsymbol{\Sigma}}^{-1}(\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})}{(\mathbf{y} - \hat{\mu})^\top \hat{\boldsymbol{\Sigma}}^{-1}(\mathbf{y} - \hat{\mu})}, \end{equation*} where μ̂=(𝟏𝚺̂1𝟏)1𝟏𝚺̂1𝐲\hat{\mu} = (\boldsymbol{1}^\top \hat{\boldsymbol{\Sigma}}^{-1} \boldsymbol{1})^{-1} \boldsymbol{1}^\top \hat{\boldsymbol{\Sigma}}^{-1} \mathbf{y}. For the non-spatial model, the pseudo R-squared reduces to the classical R-squared, as PR2=1(𝐲𝐗𝛃̂)𝚺̂1(𝐲𝐗𝛃̂)(𝐲μ̂)𝚺̂1(𝐲μ̂)=1(𝐲𝐗𝛃̂)(𝐲𝐗𝛃̂)(𝐲μ̂)(𝐲μ̂)=1SSESST=R2,\begin{equation*} PR2 = 1 - \frac{(\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})^\top \hat{\boldsymbol{\Sigma}}^{-1}(\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})}{(\mathbf{y} - \hat{\mu})^\top \hat{\boldsymbol{\Sigma}}^{-1}(\mathbf{y} - \hat{\mu})} = 1 - \frac{(\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})^\top (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})}{(\mathbf{y} - \hat{\mu})^\top (\mathbf{y} - \hat{\mu})} = 1 - \frac{\text{SSE}}{\text{SST}} = R2, \end{equation*} where SSE denotes the error sum of squares and SST denotes the total sum of squares. The result follows because for a non-spatial model, 𝚺\boldsymbol{\Sigma} is proportional to the identity matrix.

The adjusted pseudo r-squared adjusts for additional explanatory variables and is given by PR2adj=1(1PR2)n1np.\begin{equation*} PR2adj = 1 - (1 - PR2)\frac{n - 1}{n - p}. \end{equation*} If the fitted model does not have an intercept, the n1n - 1 term is instead nn.

residuals()

Terminology regarding residual names is often conflicting and confusing. Because of this, we explicitly define the residual options we use in spmodel. These definitions may be different from others you have seen in the literature.

When type = "response", response residuals are returned: 𝐞r=𝐲𝐗𝛃̂.\begin{equation*} \mathbf{e}_{r} = \mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}. \end{equation*}

When type = "pearson", Pearson residuals are returned: 𝐞p=𝚺̂1/2𝐞r,\begin{equation*} \mathbf{e}_{p} = \hat{\boldsymbol{\Sigma}}^{-1/2}\mathbf{e}_{r}, \end{equation*} If the errors are normal (Gaussian), the Pearson residuals should be approximately normally distributed with mean zero and variance one. The result follows when 𝚺̂1/2𝚺1/2\hat{\boldsymbol{\Sigma}}^{-1/2} \approx \boldsymbol{\Sigma}^{-1/2} because E(𝚺1/2𝐞r)=𝚺1/2E(𝐞r)=𝚺1/2𝟎=𝟎\begin{equation*} \text{E}(\boldsymbol{\Sigma}^{-1/2} \mathbf{e}_{r}) = \boldsymbol{\Sigma}^{-1/2} \text{E}(\mathbf{e}_{r}) = \boldsymbol{\Sigma}^{-1/2} \boldsymbol{0} = \boldsymbol{0} \end{equation*} and Cov(𝚺1/2𝐞r)=𝚺1/2Cov(𝐞r)𝚺1/2𝚺1/2𝚺𝚺1/2=(𝚺1/2𝚺1/2)(𝚺1/2𝚺1/2)=𝐈\begin{equation*} \begin{split} \text{Cov}(\boldsymbol{\Sigma}^{-1/2} \mathbf{e}_{r}) & = \boldsymbol{\Sigma}^{-1/2} \text{Cov}(\mathbf{e}_{r}) \boldsymbol{\Sigma}^{-1/2} \\ & \approx \boldsymbol{\Sigma}^{-1/2} \boldsymbol{\Sigma} \boldsymbol{\Sigma}^{-1/2} \\ & = (\boldsymbol{\Sigma}^{-1/2} \boldsymbol{\Sigma}^{1/2})(\boldsymbol{\Sigma}^{1/2} \boldsymbol{\Sigma}^{-1/2}) \\ & = \mathbf{I} \end{split} \end{equation*}

When type = "standardized", standardized residuals are returned: 𝐞s=𝐞p11diag(𝐇*),\begin{equation*} \mathbf{e}_{s} = \mathbf{e}_{p} \odot \frac{1}{\sqrt{1 - diag(\mathbf{H}^*)}}, \end{equation*} where diag(𝐇*)diag(\mathbf{H}^*) is the diagonal of the spatial hat matrix, 𝐇s𝐗*(𝐗*𝐗*)1𝐗*\mathbf{H}_s \equiv \mathbf{X}^* (\mathbf{X}^{* \top} \mathbf{X}^*)^{-1} \mathbf{X}^{* \top}, and \odot denotes the Hadmard (element-wise) product. This residual transformation “standardizes” the Pearson residuals. As such, the standardized residuals should also have mean zero and variance Cov(𝐞s)=Cov((𝐈𝐇*)𝚺̂1/2𝐲)Cov((𝐈𝐇*)𝚺1/2𝐲)=(𝐈𝐇*)𝚺1/2Cov(𝐲)𝚺1/2(𝐈𝐇*)=(𝐈𝐇*)𝚺1/2𝚺𝚺1/2(𝐈𝐇*)=(𝐈𝐇*)𝐈(𝐈𝐇*)=(𝐈𝐇*),\begin{equation*} \begin{split} \text{Cov}(\mathbf{e}_{s}) & = \text{Cov}((\mathbf{I} - \mathbf{H}^*) \hat{\boldsymbol{\Sigma}}^{-1/2}\mathbf{y}) \\ & \approx \text{Cov}((\mathbf{I} - \mathbf{H}^*) \boldsymbol{\Sigma}^{-1/2}\mathbf{y}) \\ & = (\mathbf{I} - \mathbf{H}^*) \boldsymbol{\Sigma}^{-1/2} \text{Cov}(\mathbf{y}) \boldsymbol{\Sigma}^{-1/2}(\mathbf{I} - \mathbf{H}^*)^\top \\ & = (\mathbf{I} - \mathbf{H}^*) \boldsymbol{\Sigma}^{-1/2} \boldsymbol{\Sigma} \boldsymbol{\Sigma}^{-1/2}(\mathbf{I} - \mathbf{H}^*)^\top \\ & = (\mathbf{I} - \mathbf{H}^*) \mathbf{I} (\mathbf{I} - \mathbf{H}^*)^\top \\ & = (\mathbf{I} - \mathbf{H}^*), \end{split} \end{equation*} because (𝐈𝐇*)(\mathbf{I} - \mathbf{H}^*) is symmetric and idempotent. Note that the average value of diag(𝐇*)diag(\mathbf{H}^*) is p/np / n, so (𝐈𝐇*)𝐈(\mathbf{I} - \mathbf{H}^*) \approx \mathbf{I} for large sample sizes.

spautor() and splm()

Next we discuss technical details for the spautor() and splm() functions. Many of the details for the two functions are the same, though occasional differences are noted in the following subsection headers. Specifically, spautor() and splm() are for different data types and use different covariance functions. spautor() is for spatial linear models with areal data (i.e., spatial autoregressive models) and splm() is for spatial linear models with point-referenced data (i.e., geostatistical models). There are also a few features splm() has that spautor() does not: semivariogram-based estimation, random effects, anisotropy, and big data approximations.

spautor() Spatial Covariance Functions

For areal data, the covariance matrix depends on the specification of a neighborhood structure among the observations. Observations with at least one neighbor (not including itself) are called “connected” observations. Observations with no neighbors are called “unconnected” observations. The autoregressive spatial covariance matrix can be defined as 𝚺=[σde2𝐑𝟎𝟎σξ2𝐈]+σie2𝐈,\begin{equation*} \boldsymbol{\Sigma} = \begin{bmatrix} \sigma^2_{de} \mathbf{R} & \mathbf{0} \\ \mathbf{0} & \sigma^2_{\xi} \mathbf{I} \end{bmatrix} + \sigma^2_{ie} \mathbf{I}, \end{equation*} where σde2\sigma^2_{de}(0)(\geq 0) is the spatially dependent (correlated) variance for the connected observations, 𝐑\mathbf{R} is a matrix that describes the spatial dependence for the connected observations, σξ2\sigma^2_{\xi}(0)(\geq 0) is the independent (not correlated) variance for the unconnected observations, and σie2\sigma^2_{ie}(0)(\geq 0) is the independent (not correlated) variance for all observations. As seen, the connected and unconnected observations are allowed different variances. The total variance for connected observations is then σde2+σie2\sigma^2_{de} + \sigma^2_{ie} and the total variance for unconnected observations is σξ2+σie2\sigma^2_{\xi} + \sigma^2_{ie}. spmodel accommodates two spatial covariances: conditional autoregressive (CAR) and simultaneous autoregressive (SAR), both of which have their 𝐑\mathbf{R} forms provided in the following table.

The forms of R for each spatial covariance type available in spautor()
Spatial covariance type 𝐑\mathbf{R} functional form
"𝚌𝚊𝚛"\texttt{"car"} (𝐈ϕ𝐖)1𝐌(\mathbf{I} - \phi\mathbf{W})^{-1}\mathbf{M}
"𝚜𝚊𝚛"\texttt{"sar"} [(𝐈ϕ𝐖)(𝐈ϕ𝐖)]1[(\mathbf{I} - \phi\mathbf{W})(\mathbf{I} - \phi\mathbf{W})^\top]^{-1}

For both CAR and SAR covariance functions, 𝐑\mathbf{R} depends on similar quantities: 𝐈\mathbf{I}, an identity matrix; ϕ\phi, a range parameter, and 𝐖\mathbf{W}, a matrix that defines the neighborhood structure. Often 𝐖\mathbf{W} is symmetric but it need not be. Valid values for ϕ\phi are in (1/λmin,1/λmax)(1 / \lambda_{min}, 1 / \lambda_{max}), where λmin\lambda_{min} is the minimum eigenvalue of 𝐖\mathbf{W} and λmax\lambda_{max} is the maximum eigenvalue of 𝐖\mathbf{W}(Ver Hoef et al. 2018). For SAR covariance functions, λmin\lambda_{min} must be negative and λmax\lambda_{max} must be positive. For CAR covariances functions, a matrix 𝐌\mathbf{M} matrix must be provided that satisfies the CAR symmetry condition, which enforces the symmetry of the covariance matrix. The CAR symmetry condition states 𝐖ij𝐌ii=𝐖ji𝐌jj\begin{equation*} \frac{\mathbf{W}_{ij}}{\mathbf{M}_{ii}} = \frac{\mathbf{W}_{ji}}{\mathbf{M}_{jj}} \end{equation*} for all ii and jj, where ii and jj index rows or columns. When 𝐖\mathbf{W} is symmetric, 𝐌\mathbf{M} is often taken to be the identity matrix.

The default in spmodel is to row-standardize 𝐖\mathbf{W} by dividing each element by its respective row sum, which decreases variance. If row-standardization is not used for a CAR model, the default in spmodel for 𝐌\mathbf{M} is the identity matrix.

splm() Spatial Covariance Functions

For point-referenced data, the spatial covariance is given by σde2𝐑+σie2𝐈,\begin{equation*} \sigma^2_{de}\mathbf{R} + \sigma^2_{ie} \mathbf{I}, \end{equation*} where σde2\sigma^2_{de}(0)(\geq 0) is the spatially dependent (correlated) variance, 𝐑\mathbf{R} is a spatial correlation matrix, σie2\sigma^2_{ie}(0)(\geq 0) is the spatially independent (not correlated) variance, and 𝐈\mathbf{I} is an identity matrix. The 𝐑\mathbf{R} matrix always depends on a range parameter, ϕ\phi(>0)(> 0), that controls the behavior of the covariance function with distance. For some covariance functions, the 𝐑\mathbf{R} matrix depends on an additional parameter that we call the “extra” parameter. The following table shows the parametric form for all 𝐑\mathbf{R} matrices available in splm(). The range parameter is denoted as ϕ\phi, the distance is denoted as hh, the distance divided by the range parameter (h/ϕh / \phi) is denoted as η\eta, {}\mathcal{I}\{\cdot\} is an indicator function equal to one when the argument occurs and zero otherwise, and the extra parameter is denoted as ξ\xi (when relevant).

The forms of R for each spatial covariance type available in splm(). All spatial covariance functions are valid in two dimensions except the triangular and cosine functions, which are only valid in one dimension.
Spatial Covariance Type R Functional Form
"𝚎𝚡𝚙𝚘𝚗𝚎𝚗𝚝𝚒𝚊𝚕"\texttt{"exponential"} eηe^{-\eta}
"𝚜𝚙𝚑𝚎𝚛𝚒𝚌𝚊𝚕"\texttt{"spherical"} (11.5η+0.5η3){hϕ}(1 - 1.5\eta + 0.5\eta^3)\mathcal{I}\{h \leq \phi \}
"𝚐𝚊𝚞𝚜𝚜𝚒𝚊𝚗"\texttt{"gaussian"} eη2e^{-\eta^2}
"𝚝𝚛𝚒𝚊𝚗𝚐𝚞𝚕𝚊𝚛"\texttt{"triangular"} (1η){hϕ}(1 - \eta)\mathcal{I}\{h \leq \phi \}
"𝚌𝚒𝚛𝚌𝚞𝚕𝚊𝚛"\texttt{"circular"} (12π[m1m2+sin1{m}]){hϕ},m=min(η,1)(1 - \frac{2}{\pi}[m\sqrt{1 - m^2} + sin^{-1}\{m\}])\mathcal{I}\{h \leq \phi \}, m = min(\eta, 1)
"𝚌𝚞𝚋𝚒𝚌"\texttt{"cubic"} (17η2+8.75η33.5η5+0.75η7){hϕ}(1 - 7\eta^2 + 8.75\eta^3 - 3.5\eta^5 + 0.75 \eta^7)\mathcal{I}\{h \leq \phi \} \
"𝚙𝚎𝚗𝚝𝚊𝚜𝚙𝚑𝚎𝚛𝚒𝚌𝚊𝚕"\texttt{"pentaspherical"} (11.875η+1.250η30.375η5){hϕ}(1 - 1.875\eta + 1.250\eta^3 - 0.375\eta^5)\mathcal{I}\{h \leq \phi \} \
"𝚌𝚘𝚜𝚒𝚗𝚎"\texttt{"cosine"} $ cos() $
"𝚠𝚊𝚟𝚎"\texttt{"wave"} sin(η)η{h>0}+{h=0}\frac{sin(\eta)}{\eta}\mathcal{I}\{h > 0 \} + \mathcal{I}\{h = 0 \}
"𝚓𝚋𝚎𝚜𝚜𝚎𝚕"\texttt{"jbessel"} Bj(hϕ),BjB_j(h\phi), B_j is Bessel-J
"𝚐𝚛𝚊𝚟𝚒𝚝𝚢"\texttt{"gravity"} (1+η2)1/2(1 + \eta^2)^{-1/2}
"𝚛𝚚𝚞𝚊𝚍"\texttt{"rquad"} (1+η2)1(1 + \eta^2)^{-1}
"𝚖𝚊𝚐𝚗𝚎𝚝𝚒𝚌"\texttt{"magnetic"} (1+η2)3/2(1 + \eta^2)^{-3/2}
"𝚖𝚊𝚝𝚎𝚛𝚗"\texttt{"matern"} 2(1ξ)Γ(ξ)αξBk(α,ξ),α=2ξη,Bk\frac{2^{(1 - \xi)}}{\Gamma(\xi)} \alpha^\xi B_k(\alpha, \xi), \alpha = \sqrt{2\xi \eta}, B_k is Bessel-K with order ξ\xi, ξ[1/5,5]\xi \in [1/5, 5]
"𝚌𝚊𝚞𝚌𝚑𝚢"\texttt{"cauchy"} (1+η2)ξ(1 + \eta^2)^{-\xi}, ξ>0\xi > 0
"𝚙𝚎𝚡𝚙𝚘𝚗𝚎𝚗𝚝𝚒𝚊𝚕"\texttt{"pexponential"} exp(hξ/ϕ)exp(-h^\xi / \phi), ξ(0,2]\xi \in (0, 2]
"𝚗𝚘𝚗𝚎"\texttt{"none"} 00

Model-fitting

Likelihood-based Estimation (estmethod = "reml" or estmethod = "ml")

Minus twice a profiled (by 𝛃\boldsymbol{\beta}) Gaussian log-likelihood is given by 2p(𝛉)=ln|𝚺|+(𝐲𝐗𝛃̃)𝚺1(𝐲𝐗𝛃̃)+nln2π,\begin{equation}\label{eq:ml-lik} -2\ell_p(\boldsymbol{\theta}) = \ln{|\boldsymbol{\Sigma}|} + (\mathbf{y} - \mathbf{X} \tilde{\boldsymbol{\beta}})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{y} - \mathbf{X} \tilde{\boldsymbol{\beta}}) + n \ln{2\pi}, \end{equation} where 𝛃̃=(𝐗𝚺1𝐗)1𝐗𝚺1𝐲\tilde{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{\Sigma}^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{\Sigma}^{-1} \mathbf{y}. Minimizing this equation yields 𝛉̂ml\boldsymbol{\hat{\theta}}_{ml}, the maximum likelihood estimates for 𝛉\boldsymbol{\theta}. Then a closed form solution exists for 𝛃̂ml\boldsymbol{\hat{\beta}}_{ml}, the maximum likelihood estimates for 𝛃\boldsymbol{\beta}: 𝛃̂ml=𝛃̃ml\boldsymbol{\hat{\beta}}_{ml} = \tilde{\boldsymbol{\beta}}_{ml}, where 𝛃̃ml\tilde{\boldsymbol{\beta}}_{ml} is 𝛃̃\tilde{\boldsymbol{\beta}} evaluated at 𝛉̂ml\boldsymbol{\hat{\theta}}_{ml}. To reduce bias in that variances of 𝛃̂ml\boldsymbol{\hat{\beta}}_{ml} that can occur due to the simultaneous estimation of 𝛃\boldsymbol{\beta} and 𝛉\boldsymbol{\theta}, restricted maximum likelihood estimation (REML) (Patterson and Thompson 1971; Harville 1977; Wolfinger, Tobias, and Sall 1994) has been shown to be better than maximum likelihood estimation. Integrating 𝛃\boldsymbol{\beta} out of a Gaussian likelihood yields the restricted Gaussian likelihood. Minus twice a restricted Gaussian log-likelihood is given by 2R(𝛉)=2p(𝛉)+ln|𝐗𝚺1𝐗|pln2π,\begin{equation}\label{eq:reml-lik} -2\ell_R(\boldsymbol{\theta}) = -2\ell_p(\boldsymbol{\theta}) + \ln{|\mathbf{X}^\top \boldsymbol{\Sigma}^{-1} \mathbf{X}|} - p \ln{2\pi} , \end{equation} where pp equals the dimension of 𝛃\boldsymbol{\beta}. Minimizing this equation yields 𝛉̂reml\boldsymbol{\hat{\theta}}_{reml}, the restricted maximum likelihood estimates for 𝛉\boldsymbol{\theta}. Then a closed for solution exists for 𝛃̂reml\boldsymbol{\hat{\beta}}_{reml}, the restricted maximum likelihood estimates for 𝛃\boldsymbol{\beta}: 𝛃̂reml=𝛃̃reml\boldsymbol{\hat{\beta}}_{reml} = \tilde{\boldsymbol{\beta}}_{reml}, where 𝛃̃reml\tilde{\boldsymbol{\beta}}_{reml} is 𝛃̃\tilde{\boldsymbol{\beta}} evaluated at 𝛉̂reml\boldsymbol{\hat{\theta}}_{reml}.

The covariance matrix can often be written as 𝚺=σ2𝚺*\boldsymbol{\Sigma} = \sigma^2 \boldsymbol{\Sigma}^*, where σ2\sigma^2 is the overall variance and 𝚺*\boldsymbol{\Sigma}^* is a covariance matrix that depends on parameter vector 𝛉*\boldsymbol{\theta}^* with one less dimension than 𝛉\boldsymbol{\theta}. Then the overall variance, σ2\sigma^2, can be profiled out of the previous likelihood equation. This reduces the number of parameters requiring optimization by one, which can dramatically reduce estimation time. Further profiling out σ2\sigma^2 yields 2p*(𝛉*)=ln|𝚺*|+nln[(𝐲𝐗𝛃̃)𝚺*1(𝐲𝐗𝛃̃)]+n+nln2π/n.\begin{equation*}\label{eq:ml-plik} -2\ell_p^*(\boldsymbol{\theta}^*) = \ln{|\boldsymbol{\Sigma^*}|} + n\ln[(\mathbf{y} - \mathbf{X} \tilde{\boldsymbol{\beta}})^\top \boldsymbol{\Sigma}^{* -1} (\mathbf{y} - \mathbf{X} \tilde{\boldsymbol{\beta}})] + n + n\ln{2\pi / n}. \end{equation*} After finding 𝛉̂ml*\hat{\boldsymbol{\theta}}^*_{ml}, a closed form solution for σ̂ml2\hat{\sigma}^2_{ml} exists: σ̂ml2=[(𝐲𝐗𝛃̃)𝚺*1(𝐲𝐗𝛃̃)]/n\hat{\sigma}^2_{ml} = [(\mathbf{y} - \mathbf{X} \boldsymbol{\tilde{\beta}})^\top \mathbf{\Sigma}^{* -1} (\mathbf{y} - \mathbf{X} \tilde{\boldsymbol{\beta}})] / n. Then 𝛉̂ml*\boldsymbol{\hat{\theta}}^*_{ml} is combined with σ̂ml2\hat{\sigma}^2_{ml} to yield 𝛉̂ml\boldsymbol{\hat{\theta}}_{ml} and subsequently 𝛃̂ml\boldsymbol{\hat{\beta}}_{ml}. A similar result holds for restricted maximum likelihood estimation. Further profiling out σ2\sigma^2 yields 2R*(𝚯)=ln|𝚺*|+(np)ln[(𝐲𝐗𝛃̃)𝚺*1(𝐲𝐗𝛃̃)]+ln|𝐗𝚺*1𝐗|+(np)+(np)ln2π/(np).\begin{equation*}\label{eq:reml-plik} -2\ell_R^*(\boldsymbol{\Theta}) = \ln{|\boldsymbol{\Sigma}^*|} + (n - p)\ln[(\mathbf{y} - \mathbf{X} \tilde{\boldsymbol{\beta}})^\top \boldsymbol{\Sigma}^{* -1} (\mathbf{y} - \mathbf{X} \tilde{\boldsymbol{\beta}})] + \ln{|\mathbf{X}^\top \boldsymbol{\Sigma}^{* -1} \mathbf{X}|} + (n - p) + (n - p)\ln2\pi / (n - p). \end{equation*} After finding 𝛉̂reml*\hat{\boldsymbol{\theta}}^*_{reml}, a closed form solution for σ̂reml2\hat{\sigma}^2_{reml} exists: σ̂reml2=[(𝐲𝐗𝛃̃)𝚺*1(𝐲𝐗𝛃̃)]/(np)\hat{\sigma}^2_{reml} = [(\mathbf{y} - \mathbf{X} \boldsymbol{\tilde{\beta}})^\top \mathbf{\Sigma}^{* -1} (\mathbf{y} - \mathbf{X} \tilde{\boldsymbol{\beta}})] / (n - p). Then 𝛉̂reml*\boldsymbol{\hat{\theta}}^*_{reml} is combined with σ̂reml2\hat{\sigma}^2_{reml} to yield 𝛉̂reml\boldsymbol{\hat{\theta}}_{reml} and subsequently 𝛃̂reml\boldsymbol{\hat{\beta}}_{reml}. For more on profiling Gaussian likelihoods, see Wolfinger, Tobias, and Sall (1994).

Both maximum likelihood and restricted maximum likelihood estimation rely on the n×nn \times n covariance matrix inverse. Inverting an n×nn \times n matrix is an enormous computational demand that scales cubically with the sample size. For this reason, maximum likelihood and restricted maximum likelihood estimation have historically been infeasible to implement in their standard form with data larger than a few thousand observations. This motivates the use for big data approaches.

Semivariogram-based Estimation (splm() only)

An alternative approach to likelihood-based estimation is semivariogram-based estimation. The semivariogram of a constant-mean process 𝐲\mathbf{y} is the expectation of half of the squared difference between two observations hh distance apart. More formally, the semivariogram is denoted γ(h)\gamma(h) and defined as γ(h)=E[(yiyj)2]/2,\begin{equation*}\label{eq:sv} \gamma(h) = \text{E}[(y_i - y_j)^2] / 2 , \end{equation*} where hh is the Euclidean distance between the locations of yiy_i and yjy_j. When the process 𝐲\mathbf{y} is second-order stationary, the semivariogram and covariance function are intimately connected: γ(h)=σ2Cov(h)\gamma(h) = \sigma^2 - \text{Cov}(h), where σ2\sigma^2 is the overall variance and Cov(h)\text{Cov}(h) is the covariance function evaluated at hh. As such, the semivariogram and covariance function rely on the same parameter vector 𝛉\boldsymbol{\theta}. Both of the semivariogram approaches described next are more computationally efficient than restricted maximum likelihood and maximum likelihood estimation because the major computational burden of the semivariogram approaches (calculations based on squared differences among pairs) scales quadratically with the sample size (i.e., not the cubed sample size like the likelihood-based approaches).

Weighted Least Squares (estmethod = "sv-wls")

The empirical semivariogram is a moment-based estimate of the semivariogram denoted by γ̂(h)\hat{\gamma}(h). It is defined as γ̂(h)=12|N(h)|N(h)(yiyj)2,\begin{equation*} \hat{\gamma}(h) = \frac{1}{2|N(h)|} \sum_{N(h)} (y_i - y_j)^2, \end{equation*} where N(h)N(h) is the set of observations in 𝐲\mathbf{y} that are hh distance units apart (distance classes) and |N(h)||N(h)| is the cardinality of N(h)N(h)(Cressie 1993). One criticism of the empirical semivariogram is that distance bins and cutoffs tend to be arbitrarily chosen (i.e., not chosen according to some statistical criteria).

Cressie (1985) proposed estimating 𝛉\boldsymbol{\theta} by minimizing an objective function that involves γ(h)\gamma(h) and γ̂(h)\hat{\gamma}(h) and is based on a weighted least squares criterion. This criterion is defined as iwi[γ̂(h)iγ(h)i]2,\begin{equation}\label{eq:svwls} \sum_i w_i [\hat{\gamma}(h)_i - \gamma(h)_i]^2, \end{equation} where wiw_i, γ̂(h)i\hat{\gamma}(h)_i, and γ(h)i\gamma(h)_i are the weights, empirical semivariogram, and semivariogram for the iith distance class, respectively. Minimizing this loss function yields 𝛉̂wls\boldsymbol{\hat{\theta}}_{wls}, the semivariogram weighted least squares estimate of 𝛉\boldsymbol{\theta}. After estimating 𝛉\boldsymbol{\theta}, 𝛃\boldsymbol{\beta} estimates are constructed using (empirical) generalized least squares: 𝛃̂wls=(𝐗𝚺̂1𝐗)1𝐗𝚺̂1𝐲\boldsymbol{\hat{\beta}}_{wls} = (\mathbf{X}^\top \hat{\mathbf{\Sigma}}^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \hat{\mathbf{\Sigma}}^{-1} \mathbf{y}.

Cressie (1985) recommends setting wi=|N(h)|/γ(h)i2w_i = |N(h)| / \gamma(h)_i^2, which gives more weight to distance classes with more observations (|N(h)||N(h)|) and shorter distances (1/γ(h)i21 / \gamma(h)_i^2). The default in spmodel is to use these wiw_i, known as Cressie weights, though several other options for wiw_i exist and are available via the weights argument. The following table contains all wiw_i available via the weights argument.

Table of values for the weights argument in splm() when estmethod = “sv-wls”.
wiw_i Name wiw_i Form 𝚠𝚎𝚒𝚐𝚑𝚝 =\texttt{weight =}
Cressie |N(h)|/γ(h)i2|N(h)| / \gamma(h)_i^2 "𝚌𝚛𝚎𝚜𝚜𝚒𝚎"\texttt{"cressie"}
Cressie (Denominator) Root |N(h)|/γ(h)i|N(h)| / \gamma(h)_i "𝚌𝚛𝚎𝚜𝚜𝚒𝚎-𝚍𝚛"\texttt{"cressie-dr"}
Cressie No Pairs 1/γ(h)i21 / \gamma(h)_i^2 "𝚌𝚛𝚎𝚜𝚜𝚒𝚎-𝚗𝚘𝚙𝚊𝚒𝚛𝚜"\texttt{"cressie-nopairs"}
Cressie (Denominator) Root No Pairs 1/γ(h)i1 / \gamma(h)_i "𝚌𝚛𝚎𝚜𝚜𝚒𝚎-𝚍𝚛-𝚗𝚘𝚙𝚊𝚒𝚛𝚜"\texttt{"cressie-dr-nopairs"}
Pairs |N(h)||N(h)| 𝚙𝚊𝚒𝚛𝚜"\texttt{pairs"}
Pairs Inverse Distance |N(h)|/h2|N(h)| / h^2 "𝚙𝚊𝚒𝚛𝚜-𝚒𝚗𝚟𝚍"\texttt{"pairs-invd"}
Pairs Inverse (Root) Distance |N(h)|/h|N(h)| / h "𝚙𝚊𝚒𝚛𝚜-𝚒𝚗𝚟𝚛𝚍"\texttt{"pairs-invrd"}
Ordinary Least Squares 1 "𝚘𝚕𝚜"\texttt{"ols"}

The number of N(h)N(h) classes and the maximum distance for hh are specified by passing the bins and cutoff arguments to splm() (these arguments are passed via ... to esv()). The default value for bins is 15 and the default value for cutoff is half the maximum distance of the spatial domain’s bounding box.

Recall that the semivariogram is defined for a constant-mean process. Generally, 𝐲\mathbf{y} does not necessarily have a constant mean so the empirical semivariogram and 𝛉̂wls\boldsymbol{\hat{\theta}}_{wls} are typically constructed using the residuals from an ordinary least squares regression of 𝐲\mathbf{y} on 𝐗\mathbf{X}. These ordinary least squares residuals are assumed to have mean zero.

Composite Likelihood (estmethod = "sv-cl")

Composite likelihood approaches involve constructing likelihoods based on conditional or marginal events for which likelihoods are available and then adding together these individual components. Composite likelihoods are attractive because they behave very similar to likelihoods but are easier to handle, both from a theoretical and from a computational perspective. Curriero and Lele (1999) derive a particular composite likelihood for estimating semivariogram parameters. The negative log of this composite likelihood, denoted CL(h)\text{CL}(h), is given by CL(h)=i=1n1j>i((yiyj)22γ(h)+ln(γ(h)))\begin{equation}\label{eq:svcl} \text{CL}(h) = \sum_{i = 1}^{n - 1} \sum_{j > i} \left( \frac{(y_i - y_j)^2}{2\gamma(h)} + \ln(\gamma(h)) \right) \end{equation} where γ(h)\gamma(h) is the semivariogram. Minimizing this loss function yields 𝛉̂cl\boldsymbol{\hat{\theta}}_{cl}, the semivariogram composite likelihood estimates of 𝛉\boldsymbol{\theta}. After estimating 𝛉\boldsymbol{\theta}, 𝛃\boldsymbol{\beta} estimates are constructed using (empirical) generalized least squares: 𝛃̂cl=(𝐗𝚺̂1𝐗)1𝐗𝚺̂1𝐲\boldsymbol{\hat{\beta}}_{cl} = (\mathbf{X}^\top \hat{\mathbf{\Sigma}}^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \hat{\mathbf{\Sigma}}^{-1} \mathbf{y}.

An advantage of the composite likelihood approach to semivariogram estimation is that it does not require arbitrarily specifying empirical semivariogram bins and cutoffs. It does tend to be more computationally demanding than weighted least squares, however. The composite likelihood is constructed from (n2)\binom{n}{2} pairs for a sample size nn, whereas the weighted least squares approach only requires calculating (|N(h)|2)\binom{|N(h)|}{2} pairs for each distance bin N(h)N(h). As with the weighted least squares approach, the composite likelihood approach requires a constant-mean process, so typically the residuals from an ordinary least squares regression of 𝐲\mathbf{y} on 𝐗\mathbf{X} are used to estimate 𝛉\boldsymbol{\theta}.

Optimization

Parameter estimation is performed using stats::optim(). The default estimation method is Nelder-Mead (Nelder and Mead 1965) and the stopping criterion is a relative convergence tolerance (reltol) of .0001. If only one parameter requires estimation (on the profiled scale if relevant), the Brent algorithm is instead used (Brent 1971). Arguments to optim() are passed via ... to splm() and spautor(). For example, the default estimation method and convergence criteria are overridden by passing method and control, respectively, to splm() and spautor(). If the lower and upper arguments to optim() are specified in splm() and spautor() to be passed to optim(), they are ignored, as optimization for all parameters is generally unconstrained. Initial values for optim() are found using the grid search described next.

Grid Search

spmodel uses a grid search to find suitable initial values for use in optimization. For spatial linear models without random effects, the spatially dependent variance (σde2\sigma^2_{de}) and spatially independent variance (σie2\sigma^2_{ie}) parameters are given “low”, “medium”, and “high” values. The sample variance of a non-spatial linear model is slightly inflated by a factor of 1.2 (non-spatial models can underestimate the variance when there is spatial dependence) and these “low”, “medium”, and “high” values correspond to 10%, 50%, and 90% of the inflated sample variance. Only combinations of σde2\sigma^2_{de} and σie2\sigma^2_{ie} whose proportions sum to 100% are considered. The range (ϕ\phi) and extra (ξ\xi) parameters are given “low” and “high” values that are unique to each spatial covariance function. For example, when using an exponential covariance function, the “low” value of ϕ\phi is one-half the diagonal of the domain’s bounding box divided by three. This particular value is chosen so that the effective range (the distance at which the covariance is approximately zero), which equals 3ϕ3\phi for the exponential covariance function, is is reached at one-half the diagonal of the domain’s bounding box. Analogously, the “high” value of ϕ\phi is three-halves the diagonal of the domain’s bounding box divided by three. The anisotropy rotation parameter (α\alpha) is given six values that correspond to 0, π/6\pi/6, 2π/62\pi/6, 4π/64\pi/6, 5π/65\pi/6, and π\pi radians. The anisotropy scale parameter (SS) is given “low”, “medium”, and “high” values that correspond to scaling factors of 0.25, 0.75, and 1. Note that the anisotropy parameters are only used during grid searches for point-referenced data.

The crossing of all appropriate parameter values is considered. If initial values are used for a parameter, the initial value replaces all values of the parameter in this crossing. Duplicate crossings are then omitted. The parameter configuration that yields the smallest value of the objective function is then used as an initial value for optimization. Suppose the inflated sample variance is 10, the exponential covariance is used assuming isotropy, and the diagonal of the bounding box is 180 distance units. The parameter configurations evaluated are shown in the following table.

Grid search parameter configurations for an isotropic exponential spatial covariance with inflated sample variance equal to 10 and diagonal of the bounding box equal to 180 distance units.
σde2\sigma^2_{de} σie2\sigma^2_{ie} ϕ\phi α\alpha SS
9 1 15 0 1
1 9 15 0 1
5 5 15 0 1
9 1 45 0 1
1 9 45 0 1
5 5 15 0 1

For spatial linear models with random effects, the same approach is used to create a crossing of spatial covariance parameters. A separate approach is used to create a set of random effect variances. The random effect variances are similarly first grouped by proportions. The first combination is such that the first random effect variance is given 90% of variance, and the remaining 10% is spread out evenly among the remaining random effect variances. The second combination is such that the second random effect variance is given 90% of the variance, and the remaining 10% is spread out evenly among the remaining random effect variances. And so on and so forth. These combinations ascertain whether one random effect dominates variability. A final grouping is lastly considered: all 100% of variance is spread out evenly among all random effects.

When finding parameter values σde2\sigma^2_{de}, σie2\sigma^2_{ie}, and the random effect variances (σui2\sigma^2_{u_i} for the iith random effect), three scenarios are considered. In the first scenario, σde2\sigma^2_{de} and σie2\sigma^2_{ie} get 90% of the inflated sample variance and the random effect variances get 10%. In this scenario, only the random effect grouping where the variance is evenly spread out is considered. This is because the random effect variances are already contributing little to the overall variability, so performing additional objective function evaluations is unnecessary. In the second scenario, the random effects get 90% of the inflated sample variances and σde2\sigma^2_{de} and σie2\sigma^2_{ie} get 10%. Similarly in this scenario, only the σde2\sigma^2_{de} and σie2\sigma^2_{ie} grouping where the variance is evenly spread out is considered. Also in this scenario, only the lowest value for range and extra are used. In the third scenario, the 50% of the inflated sample variance is given to σde2\sigma^2_{de} and σie2\sigma^2_{ie} and 50% to the random effects. In this scenario, the only parameter combination considered is the case where variances are evenly spread out among σde2\sigma^2_{de}, σie2\sigma^2_{ie}, and the random effect variances. Together, there are parameter configurations where the spatial variability dominates (scenario 1), the random variability dominates (scenario 2), and where there is an even contribution from spatial and random variability. The parameter configuration that minimizes the objective function is then used as an initial value for optimization. Recall that random effects are only used with restricted maximum likelihood or maximum likelihood estimation, so the objective function is always a likelihood.

Suppose the inflated sample variance is 10, the exponential covariance is used assuming isotropy, the diagonal of the bounding box is 180 distance units, and there are two random effects. The parameter configurations evaluated are shown in the following table.

Grid search parameter configurations for an isotropic exponential spatial covariance with two random effects, inflated sample variance equal to 10, and diagonal of the bounding box equal to 180 distance units.
σde2\sigma^2_{de} σie2\sigma^2_{ie} ϕ\phi α\alpha SS σu12\sigma^2_{u1} σu22\sigma^2_{u2}
8.1 0.9 15 0 1 0.5 0.5
0.9 8.1 15 0 1 0.5 0.5
4.5 4.5 15 0 1 0.5 0.5
8.1 0.9 45 0 1 0.5 0.5
0.9 8.1 45 0 1 0.5 0.5
4.5 4.5 45 0 1 0.5 0.5
0.5 0.5 15 0 1 8.1 0.9
0.5 0.5 15 0 1 0.9 8.1
0.5 0.5 15 0 1 4.5 4.5
2.5 2.5 15 0 1 2.5 2.5
2.5 2.5 45 0 1 2.5 2.5

This grid search approach balances a thorough exploration of the parameter space with computational efficiency, as each objective function evaluation can be computationally expensive.

Hypothesis Testing

The hypothesis tests for 𝛃̂\hat{\boldsymbol{\beta}} returned by summary() or tidy() of an splm or spautor object are asymptotic z-tests based on the normal (Gaussian) distribution (Wald tests). The null hypothesis for the test associated with each β̂i\hat{\beta}_i is that βi=0\beta_i = 0. Then the test statistic is given by z̃=β̂iSE(β̂i),\begin{equation*} \tilde{z} = \frac{\hat{\beta}_i}{\text{SE}(\hat{\beta}_i)}, \end{equation*} where SE(β̂i)\text{SE}(\hat{\beta}_i) is the standard error of β̂i\hat{\beta}_i, which equals the square root of the iith diagonal element of (𝐗𝚺̂1𝐗)1(\mathbf{X}^\top \hat{\boldsymbol{\Sigma}}^{-1} \mathbf{X})^{-1}. The p-value is given by 2*(1Φ(|z̃|))2 * (1 - \Phi(|\tilde{z}|)), which corresponds to an equal-tailed, two-sided hypothesis test of level α\alpha where Φ()\Phi(\cdot) denotes the standard normal (Gaussian) cumulative distribution function and |||\cdot| denotes the absolute value.

Random Effects (splm() only and "reml" or "ml" estmethod only)

The random effects contribute directly to the covariance through their design matrices. Let 𝐮\mathbf{u} be a mean-zero random effect column vector of length nun_u, where nun_u is the number of levels of the random effect, with design matrix 𝐙u\mathbf{Z}_u. Then Cov(𝐙u𝐮)=𝐙uCov(𝐮)𝐙u\text{Cov}(\mathbf{Z}_u\mathbf{u}) = \mathbf{Z}_u \text{Cov}(\mathbf{u})\mathbf{Z}_u^\top. Because each element of 𝐮\mathbf{u} is independent of one another, this reduces to Cov(𝐙u𝐮)=σu2𝐙u𝐙u\text{Cov}(\mathbf{Z}_u\mathbf{u}) = \sigma^2_u \mathbf{Z}_u \mathbf{Z}_u^\top, where σu2\sigma^2_u is the variance parameter corresponding to the random effect (i.e., the random effect variance parameter).

The 𝐙\mathbf{Z} matrices index the levels of the random effect. 𝐙\mathbf{Z} has dimension n×nun \times n_u, where nn is the sample size. Each row of 𝐙\mathbf{Z} corresponds to an observation and each column to a level of the random effect. For example, suppose we have n=4n = 4 observations, so 𝐲={y1,y2,y3,y4}\mathbf{y} = \{y_1, y_2, y_3, y_4\}. Also suppose that the random effect 𝐮\mathbf{u} has two levels and that y1y_1 and y4y_4 are in the first level and y2y_2 and y3y_3 are in the second level. For random intercepts, each element of 𝐙\mathbf{Z} is one if the observation is in the appropriate level of the random effect and zero otherwise. So it follows that 𝐙𝐮=[10010110][u1u2],\begin{equation*} \mathbf{Z}\mathbf{u} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} u_1 \\ u_2 \end{bmatrix}, \end{equation*} where u1u_1 and u2u_2 are the random intercepts for the first and second levels of 𝐮\mathbf{u}, respectively. For random slopes, each element of 𝐙\mathbf{Z} equals the value of an auxiliary variable, 𝐤\mathbf{k}, if the observation is in the appropriate level of the random effect and zero otherwise. So if 𝐤={2,7,5,4}\mathbf{k} = \{2, 7, 5, 4 \} it follows that 𝐙𝐮=[20070540][u1u2],\begin{equation*} \mathbf{Z}\mathbf{u} = \begin{bmatrix} 2 & 0 \\ 0 & 7 \\ 0 & 5 \\ 4 & 0 \end{bmatrix} \begin{bmatrix} u_1 \\ u_2 \end{bmatrix}, \end{equation*} where u1u_1 and u2u_2 are the random slopes for the first and second levels of 𝐮\mathbf{u}, respectively. If a random slope is included in the model, it is common for the auxiliary variable to be a column in 𝐗\mathbf{X}, the fixed effects design matrix (i.e., also a fixed effect). Denote this column as 𝐱\mathbf{x}. Here 𝛃\boldsymbol{\beta} captures the average effect of 𝐱\mathbf{x} on 𝐲\mathbf{y} (accounting for other explanatory variables) and 𝐮\mathbf{u} captures a subject-specific effect of 𝐱\mathbf{x} on 𝐲\mathbf{y}. So for a subject in the iith level of 𝐮\mathbf{u}, the average increase in yy associated with a one-unit increase xx is β+ui\beta + u_i.

The sv-wls and sv-cl estimation methods do not use a likelihood, and thus, they do not allow for the estimation of random effects in spmodel.

Anisotropy (splm() only)

An isotropic spatial covariance function behaves similarly in all directions (i.e., is independent of direction) as a function of distance. An anisotropic spatial covariance function does not behave similarly in all directions as a function of distance. The following figure shows ellipses for an isotropic and anisotropic spatial covariance function centered at the origin (a distance of zero). The black outline of each ellipse is a level curve of equal correlation. The left ellipse (a circle) represents an isotropic covariance function. The distance at which the correlation between two observations lays on the level curve is the same in all directions. The right ellipse represents an anisotropic covariance function. The distance at which the correlation between two observations lays on the level curve is different in different directions.

In the left figure, the ellipse of an isotropic spatial covariance function centered at the origin is shown. In the right figure, the ellipse of an anisotropic spatial covariance function centered at the origin is shown. The black outline of each ellipse is a level curve of equal correlation. In the left figure, the ellipse of an isotropic spatial covariance function centered at the origin is shown. In the right figure, the ellipse of an anisotropic spatial covariance function centered at the origin is shown. The black outline of each ellipse is a level curve of equal correlation.

In the left figure, the ellipse of an isotropic spatial covariance function centered at the origin is shown. In the right figure, the ellipse of an anisotropic spatial covariance function centered at the origin is shown. The black outline of each ellipse is a level curve of equal correlation.

To accommodate spatial anisotropy, the original coordinates must be transformed such that the transformed coordinates yield an isotropic spatial covariance. This transformation involves a rotation and a scaling. Consider a set of xx and yy coordinates that should be transformed into x*x^* and y*y^* coordinates. This transformation is formally defined as [x*y*]=[1001/S][cos(α)sin(α)sin(α)cos(α)][xy].\begin{equation*} \begin{bmatrix} x^* \\ y^* \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 / S \end{bmatrix} \begin{bmatrix} \cos(\alpha) & \sin(\alpha) \\ -\sin(\alpha) & \cos(\alpha) \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix}. \end{equation*} The original coordinates are first multiplied by the rotation matrix, which rotates the coordinates clockwise by angle α\alpha. They are then multiplied by the scaling matrix, which scales the minor axis of the spatial covariance ellipse by the reciprocal of SS. The transformed coordinates are then used to compute distances and the resulting spatial covariances. This type of anisotropy is more formally known as “geometric” anisotropy because it involves a geometric transformation of the coordinates. The following figure shows this process step-by-step.

In the left figure, the ellipse of an anisotropic spatial covariance function centered at the origin is shown. The blue lines represent the original axes and the red lines the transformed axes. The solid lines represent the x-axes and the dotted lines the y-axes. Note that the solid, red line is the major axis of the ellpise and the dashed, red line is the minor axis of the ellipse. In the center figure, the ellipse has been rotated clockwise by the rotate parameter so the major axis is the transformed x-axis and the minor axis is the transformed y-axis. In the right figure, the minor axis of the ellipse has been scaled by the reciprocal of the scale parameter so that the ellipse becomes a circle, which corresponds to an isotropic spatial covariance function. The transformed coordinates are then used to compute distances and spatial covariances.In the left figure, the ellipse of an anisotropic spatial covariance function centered at the origin is shown. The blue lines represent the original axes and the red lines the transformed axes. The solid lines represent the x-axes and the dotted lines the y-axes. Note that the solid, red line is the major axis of the ellpise and the dashed, red line is the minor axis of the ellipse. In the center figure, the ellipse has been rotated clockwise by the rotate parameter so the major axis is the transformed x-axis and the minor axis is the transformed y-axis. In the right figure, the minor axis of the ellipse has been scaled by the reciprocal of the scale parameter so that the ellipse becomes a circle, which corresponds to an isotropic spatial covariance function. The transformed coordinates are then used to compute distances and spatial covariances.In the left figure, the ellipse of an anisotropic spatial covariance function centered at the origin is shown. The blue lines represent the original axes and the red lines the transformed axes. The solid lines represent the x-axes and the dotted lines the y-axes. Note that the solid, red line is the major axis of the ellpise and the dashed, red line is the minor axis of the ellipse. In the center figure, the ellipse has been rotated clockwise by the rotate parameter so the major axis is the transformed x-axis and the minor axis is the transformed y-axis. In the right figure, the minor axis of the ellipse has been scaled by the reciprocal of the scale parameter so that the ellipse becomes a circle, which corresponds to an isotropic spatial covariance function. The transformed coordinates are then used to compute distances and spatial covariances.

In the left figure, the ellipse of an anisotropic spatial covariance function centered at the origin is shown. The blue lines represent the original axes and the red lines the transformed axes. The solid lines represent the x-axes and the dotted lines the y-axes. Note that the solid, red line is the major axis of the ellpise and the dashed, red line is the minor axis of the ellipse. In the center figure, the ellipse has been rotated clockwise by the rotate parameter so the major axis is the transformed x-axis and the minor axis is the transformed y-axis. In the right figure, the minor axis of the ellipse has been scaled by the reciprocal of the scale parameter so that the ellipse becomes a circle, which corresponds to an isotropic spatial covariance function. The transformed coordinates are then used to compute distances and spatial covariances.

Anisotropy parameters (α\alpha and SS) can be estimated in spmodel using restricted maximum likelihood or maximum likelihood. Estimating anisotropy can be challenging. First, we need to restrict the parameter space so that the two parameters are identifiable (there is a unique parameter set for each possible outcome). We restricted α\alpha to [0,π][0, \pi] radians due to symmetry of the covariance ellipse at rotations α\alpha and α+jπ\alpha + j \pi, where jj is any integer. We also restricted SS to [0,1][0, 1] because we have defined SS as the scaling factor for the length of the minor axis relative to the major axis – otherwise it would not be clear whether SS refers to the minor or major axis. Given this restricted parameter space, there is still an issue of local maxima, particularly at rotation parameters near zero, which have a rotation very close to rotation parameter π\pi, but zero is far from π\pi in the parameter space. To address the local maxima problem, each optimization iteration actually involves two likelihood evaluations – one for α\alpha and another for |πα||\pi - \alpha|, where |||\cdot| denotes absolute value. Thus one likelihood evaluation is always in [0,π/2][0, \pi/2] radians and another in [π/2,π][\pi/2, \pi] radians, exploring different quadrants of the parameter space and allowing optimization to test solutions near zero and π\pi simultaneously.

Anisotropy parameters cannot be estimated in spmodel when estmethod is sv-wls or sv-cl. However, known anisotropy parameters for these estimation methods can be specified via spcov_initial and incorporated into estimation of 𝛉\boldsymbol{\theta} and 𝛃\boldsymbol{\beta}. Anisotropy is not defined for areal data given its (binary) neighborhood structure.

Partition Factors

A partition factor is a factor (or categorical) variable in which observations from different levels of the partition factor are assumed uncorrelated. A partition matrix 𝐏\mathbf{P} of dimension n×nn \times n can be constructed to represent the partition factor. The ijijth element of 𝐏\mathbf{P} equals one if the observation in the iith row and jjth column are from the same level of the partition factor and zero otherwise. Then the initial covariance matrix (ignoring the partition factor) is updated by taking the Hadmard (element-wise) product with the partition matrix: 𝚺updated=𝚺initial𝐏,\begin{equation*} \boldsymbol{\Sigma}_{updated} = \boldsymbol{\Sigma}_{initial} \odot \mathbf{P}, \end{equation*} where \odot indicates the Hadmard product. Partition factors impose a block structure in 𝚺\boldsymbol{\Sigma}, which allows for efficient computation of 𝚺1\boldsymbol{\Sigma}^{-1} used for estimation and prediction.

When computing the empirical semivariogram using esv(), semivariances are ignored when observations are from different levels of the partition factor. For the sv-wls and sv-cl estimation methods, semivariances are ignored when observations are from different levels of the partition factor.

Big Data (splm() only)

Big data model-fitting is accommodated in spmodel using a “local” spatial indexing (SPIN) approach (Ver Hoef, Dumelle, et al. 2023). Suppose there are mm unique indexes, and each observation has one of the indexes. Then 𝚺\boldsymbol{\Sigma} can be represented blockwise as 𝚺=[𝚺1,1𝚺1,2𝚺1,m𝚺2,1𝚺2,2𝚺2,3𝚺2,m𝚺3,2𝚺3,4𝚺4,3𝚺m,1𝚺m,m],\begin{equation}\label{eq:full_cov} \boldsymbol{\Sigma} = \begin{bmatrix} \boldsymbol{\Sigma}_{1,1} & \boldsymbol{\Sigma}_{1,2} & \ldots & \ldots & \boldsymbol{\Sigma}_{1,m} \\ \boldsymbol{\Sigma}_{2,1} & \boldsymbol{\Sigma}_{2,2} & \boldsymbol{\Sigma}_{2,3} & \ldots & \boldsymbol{\Sigma}_{2,m} \\ \vdots & \boldsymbol{\Sigma}_{3,2} & \ddots & \boldsymbol{\Sigma}_{3,4} & \vdots \\ \vdots & \vdots & \boldsymbol{\Sigma}_{4,3} & \ddots & \vdots \\ \boldsymbol{\Sigma}_{m,1} & \ldots & \ldots & \ldots & \boldsymbol{\Sigma}_{m, m} \end{bmatrix}, \end{equation} To perform estimation for big data, observations with the same index value are assumed independent of observations with different index values, yielding a “big-data” covariance matrix given by 𝚺bd=[𝚺1,1𝟎𝟎𝟎𝚺2,2𝟎𝟎𝟎𝟎𝟎𝟎𝚺m,m],\begin{equation}\label{eq:bd_cov} \boldsymbol{\Sigma}_{bd} = \begin{bmatrix} \boldsymbol{\Sigma}_{1,1} & \boldsymbol{0} & \ldots & \ldots & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{\Sigma}_{2,2} & \boldsymbol{0} & \ldots & \boldsymbol{0} \\ \vdots & \boldsymbol{0} & \ddots & \boldsymbol{0} & \vdots \\ \vdots & \vdots & \boldsymbol{0} & \ddots & \vdots \\ \boldsymbol{0} & \ldots & \ldots & \ldots & \boldsymbol{\Sigma}_{m, m} \end{bmatrix}, \end{equation} Estimation then proceeds using 𝚺bd\boldsymbol{\Sigma}_{bd} instead of 𝚺\boldsymbol{\Sigma}. When computing the empirical semivariogram, semivariances are ignored when observations have different local indexes. For the sv-wls and sv-cl estimation methods, semivariances are ignored when observations have different local indexes. Via this equation, it can be seen that the local index acts as a partition factor separate from the partition factor explicitly defined by partition_factor.

spmodel allows for custom local indexes to be passed to splm(). If a custom local index is not passed, the local index is determined using the "random" or "kmeans" method. The "random" method assigns observations to indexes randomly based on the number of groups desired. The "kmeans" method uses k-means clustering (MacQueen 1967) on the x-coordinates and y-coordinates to assign observations to indexes (based on the number of clusters (groups) desired).

The estimate of 𝛃\boldsymbol{\beta} when using 𝚺bd\boldsymbol{\Sigma}_{bd} is given by 𝛃̂bd=(𝐗𝚺̂bd1𝐗)1𝐗𝚺̂bd1𝐲=𝐓xx1𝐭xy,\begin{equation}\label{eq:beta_bd} \hat{\boldsymbol{\beta}}_{bd} = (\mathbf{X}^\top \boldsymbol{\hat{\Sigma}}^{-1}_{bd}\mathbf{X})^{-1}\mathbf{X}^\top \boldsymbol{\hat{\Sigma}}^{-1}_{bd} \mathbf{y} = \mathbf{T}^{-1}_{xx}\mathbf{t}_{xy}, \end{equation} where 𝐓xx=i=1m𝐗i𝚺̂i,i1𝐗i\mathbf{T}_{xx} = \sum_{i = 1}^m \mathbf{X}_i^\top \boldsymbol{\hat{\Sigma}}^{-1}_{i, i}\mathbf{X}_i and 𝐭xy=i=1m𝐗i𝚺̂i,i1𝐲i\mathbf{t}_{xy} = \sum_{i = 1}^m \mathbf{X}_i^\top \hat{\boldsymbol{\Sigma}}^{-1}_{i, i} \mathbf{y}_i. Note that in 𝛃̂bd\hat{\boldsymbol{\beta}}_{bd}, 𝐗i\mathbf{X}_i and 𝐲i\mathbf{y}_i are the subsets of 𝐗\mathbf{X} and 𝐲\mathbf{y}, respectively, for the iith local index. This estimator acts as a pooled estimator of 𝛃\boldsymbol{\beta} across the indexes.

spmodel has four approaches for estimating the covariance matrix of 𝛃̂bd\hat{\boldsymbol{\beta}}_{bd}. The choice is determined by the var_adjust argument to local. The first approach is implements no adjustment (var_adjust = "none") and simply uses 𝐓xx1\mathbf{T}_{xx}^{-1}, which is the covariance matrix of 𝛃̂bd\hat{\boldsymbol{\beta}}_{bd} using 𝚺bd\boldsymbol{\Sigma}_{bd}. While computationally efficient, this approach ignores the covariance across indexes. It can be shown that the covariance of 𝛃̂bd\hat{\boldsymbol{\beta}}_{bd} using 𝚺\boldsymbol{\Sigma}, the full covariance matrix, is given by 𝐓xx1+𝐓xx1𝐖xx𝐓xx1,\begin{equation}\label{eq:var_theo} \mathbf{T}_{xx}^{-1} + \mathbf{T}_{xx}^{-1} \mathbf{W}_{xx}\mathbf{T}_{xx}^{-1}, \end{equation} where 𝐖=i=1m1j=i+1m(𝐗𝚺̂i,i1𝚺̂i,j𝚺̂j,j1𝐗j)+(𝐗𝚺̂i,i1𝚺̂i,j𝚺̂j,j1𝐗j)\begin{equation*} \mathbf{W} = \sum_{i = 1}^{m - 1} \sum_{j = i + 1}^m (\mathbf{X}^\top \hat{\boldsymbol{\Sigma}}^{-1}_{i, i} \hat{\boldsymbol{\Sigma}}_{i, j} \hat{\boldsymbol{\Sigma}}^{-1}_{j, j} \mathbf{X}_j) + (\mathbf{X}^\top \hat{\boldsymbol{\Sigma}}^{-1}_{i, i} \hat{\boldsymbol{\Sigma}}_{i, j} \hat{\boldsymbol{\Sigma}}^{-1}_{j, j} \mathbf{X}_j)^\top \end{equation*} This equation can be viewed as the sum of the unadjusted covariance matrix of 𝛃̂bd\hat{\boldsymbol{\beta}}_{bd} (𝐓xx1\mathbf{T}_{xx}^{-1}) and a correction that incorporates the covariance across indexes (𝐓xx1𝐖xx𝐓xx1\mathbf{T}_{xx}^{-1} \mathbf{W}_{xx}\mathbf{T}_{xx}^{-1}). This adjustment is known as the “theoretically-correct” (var_adjust = "theoretical") adjustment because it uses 𝚺\boldsymbol{\Sigma}. The theoretical adjustment is the default adjustment in spmodel because it is theoretically correct, but it is the most computationally expensive adjustment. Two alternative adjustments are also provided, and while not equal to the theoretical adjustment, they are easier to compute. They are the empirical (var_adjust = "empirical") and pooled (var_adjust = "pooled") adjustments. The empirical adjustment is given by 1m(m1)i=1m(𝛃̂i𝛃̂bd)(𝛃̂i𝛃̂bd),\begin{equation*} \frac{1}{m(m -1)} \sum_{i = 1}^m (\boldsymbol{\hat{\beta}}_i - \boldsymbol{\hat{\beta}}_{bd})(\boldsymbol{\hat{\beta}}_i - \boldsymbol{\hat{\beta}}_{bd})^\top, \end{equation*} where 𝛃̂i=(𝐗𝚺̂1𝐗)1𝐗i𝚺̂i,i1𝐲i\boldsymbol{\hat{\beta}}_i = (\mathbf{X}^\top \hat{\boldsymbol{\Sigma}}^{-1} \mathbf{X})^{-1}\mathbf{X}_i^\top \hat{\boldsymbol{\Sigma}}^{-1}_{i, i} \mathbf{y}_i. A similar adjustment could use 𝛃̂i=(𝐗i𝚺̂i,i1𝐗i)1𝐗i𝚺̂i,i1𝐲i\boldsymbol{\hat{\beta}}_i = (\mathbf{X}_i^\top \hat{\boldsymbol{\Sigma}}^{-1}_{i, i} \mathbf{X}_i)^{-1}\mathbf{X}_i \hat{\boldsymbol{\Sigma}}^{-1}_{i, i} \mathbf{y}_i, which more closely resembles a composite likelihood approach. This approach is sensitive to the presence of at least one singularity in 𝐗i𝚺̂i,i1𝐗i\mathbf{X}_i^\top \hat{\boldsymbol{\Sigma}}^{-1}_{i, i} \mathbf{X}_i, in which case the variance adjustment cannot be computed. The "pooled" variance adjustment is given by 1m2i=1m(𝐗i𝚺̂i,i1𝐗i)1.\begin{equation*} \frac{1}{m^2} \sum_{i = 1}^m (\mathbf{X}^\top_i \hat{\boldsymbol{\Sigma}}^{-1}_{i, i} \mathbf{X}_i)^{-1}. \end{equation*} Note that the pooled variance adjustment cannot be computed if any 𝐗i𝚺̂i,i1𝐗i\mathbf{X}_i^\top \hat{\boldsymbol{\Sigma}}^{-1}_{i, i} \mathbf{X}_i are singular.

splmRF() and spautorRF()

splmRF() and spautorRF() fit random forest spatial residual models designed for prediction. These models are fit by combining aspects of random forest and spatial linear modeling. First, a random forest model (Breiman 2001; James et al. 2013) is fit using the ranger R package (Wright and Ziegler 2015). Then random forest fitted values are obtained for each data observation and used to compute a residual (by subtracting the fitted value from the observed value). Then an intercept-only spatial linear model is fit to these residuals: 𝐞rf=β0+𝛕+𝛜,\begin{equation*} \mathbf{e}_{rf} = \beta_0 + \mathbf{\tau} + \mathbf{\epsilon}, \end{equation*} where 𝐞rf\mathbf{e}_{rf} are the random forest residuals. Random forest spatial residual models can significantly improve predictive accuracy for new data compared to standard random forest models by formally incorporating spatial covariance in the random forest residuals (Fox, Ver Hoef, and Olsen 2020).

Different estimation methods, different spatial covariance functions, fixing spatial covariance parameter values, random effects, anisotropy, partition factors, and big data are accommodated in the spatial linear model portion of the random forest spatial residual models by supplying their respective named arguments to splmRF() and spautorRF().

sprnorm()

Spatial normal (Gaussian) random variables are simulated by taking the sum of a fixed mean and random errors. The random errors have mean zero and covariance matrix 𝚺\boldsymbol{\Sigma}. A realization of the random errors is obtained from 𝚺1/2𝐞\boldsymbol{\Sigma}^{1/2} \mathbf{e}, where 𝐞\mathbf{e} is a normal random variable with mean zero and covariance matrix 𝐈\mathbf{I}. Then the spatial normal random variable equals 𝐲=𝛍+𝚺1/2𝐞,\begin{equation*} \mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\Sigma}^{1/2} \mathbf{e}, \end{equation*} where 𝛍\boldsymbol{\mu} is the fixed mean. It follows that E(𝐲)=𝛍+𝚺1/2E(𝐞)=𝛍Cov(𝐲)=Cov(𝚺1/2𝐞)=𝚺1/2Cov(𝐞)𝚺1/2=𝚺1/2𝚺1/2=𝚺\begin{equation*} \begin{split} \text{E}(\mathbf{y}) & = \boldsymbol{\mu} + \boldsymbol{\Sigma}^{1/2} \text{E}(\mathbf{e}) = \boldsymbol{\mu} \\ \text{Cov}(\mathbf{y}) & = \text{Cov}(\boldsymbol{\Sigma}^{1/2} \mathbf{e}) = \boldsymbol{\Sigma}^{1/2} \text{Cov}(\mathbf{e}) \boldsymbol{\Sigma}^{1/2} = \boldsymbol{\Sigma}^{1/2} \boldsymbol{\Sigma}^{1/2} = \boldsymbol{\Sigma} \end{split} \end{equation*}

vcov()

vcov() returns the variance-covariance matrix of estimated parameters. Currently, vcov() only returns the variance-covariance matrix of 𝛃̂\hat{\boldsymbol{\beta}}, the fixed effects. The variance-covariance matrix of the fixed effects is given by (𝐗𝚺̂1𝐗)1(\mathbf{X}^\top \hat{\boldsymbol{\Sigma}}^{-1} \mathbf{X})^{-1}.

Spatial Generalized Linear Models

When building spatial linear models, the response vector 𝐲\mathbf{y} is typically assumed Gaussian (given 𝐗\mathbf{X}). Relaxing this assumption on the distribution of 𝐲\mathbf{y} yields a rich class of spatial generalized linear models that can describe binary data, proportion data, count data, and skewed data that is parameterized as g(𝛍)=𝛈=𝐗𝛃+𝛕+𝛜,\begin{equation}\label{eq:spglm} g(\boldsymbol{\mu}) = \boldsymbol{\eta} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon}, \end{equation} where g()g(\cdot) is called a link function, 𝛍\boldsymbol{\mu} is the mean of 𝐲\mathbf{y}, and the remaining terms 𝐗\mathbf{X}, 𝛃\boldsymbol{\beta}, 𝛕\boldsymbol{\tau}, 𝛜\boldsymbol{\epsilon} represent the same quantities as for the spatial linear models. The link function, g()g(\cdot), “links” a function of 𝛍\boldsymbol{\mu} to the linear term 𝛈\boldsymbol{\eta} , denoted here as 𝐗𝛃+𝛕+𝛜\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon}, which is familiar from spatial linear models. Note that the linking of 𝛍\boldsymbol{\mu} to 𝛈\boldsymbol{\eta} applies element-wise to each vector. Each link function g()g(\cdot) has a corresponding inverse link function, g1()g^{-1}(\cdot). The inverse link function “links” a function of 𝛈\boldsymbol{\eta} to 𝛍\boldsymbol{\mu}. Notice that for spatial generalized linear models, we are not modeling 𝐲\mathbf{y} directly as we do for spatial linear models, but rather we are modeling a function of the mean of 𝐲\mathbf{y}. Also notice that 𝛈\boldsymbol{\eta} is unconstrained but 𝛍\boldsymbol{\mu} is usually constrained in some way (e.g., positive).

The model g(𝛍)=𝛈=𝐗𝛃+𝛕+𝛜g(\boldsymbol{\mu}) = \boldsymbol{\eta} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon} is called the spatial generalized linear model. spmodel allows fitting of spatial generalized linear models when 𝐲\mathbf{y} is a binomial (or Bernoulli), beta, Poisson, negative binomial, gamma, or inverse Gaussian random vector via the Laplace approximation and restricted maximum likelihood estimation or maximum likelihood estimation – Ver Hoef, Blagg, et al. (2023) provide further details. For binomial and beta 𝐲\mathbf{y}, the logit link function is defined as g(𝛍)=ln(𝛍1𝛍)=𝛈g(\boldsymbol{\mu}) = \ln(\frac{\boldsymbol{\mu}}{1 - \boldsymbol{\mu}}) = \boldsymbol{\eta}, and the inverse logit link function is defined as g1(𝛈)=exp(𝛈)1+exp(𝛈)=𝛍g^{-1}(\boldsymbol{\eta}) = \frac{\exp(\boldsymbol{\eta})}{1 + \exp(\boldsymbol{\eta})} = \boldsymbol{\mu}. For Poisson, negative binomial, gamma, and inverse Gaussian 𝐲\mathbf{y}, the log link function is defined as g(𝛍)=log(𝛍)=𝛈g(\boldsymbol{\mu}) = log(\boldsymbol{\mu}) = \boldsymbol{\eta}, and the inverse log link function is defined as g1(𝛈)=exp(𝛈)=𝛍g^{-1}(\boldsymbol{\eta}) = \exp(\boldsymbol{\eta}) = \boldsymbol{\mu}. Full parameterizations of these distributions are given later.

For spatial linear models, one can marginalize over 𝛃\boldsymbol{\beta} and the random components to obtain an explicit distribution of only the data (𝐲\mathbf{y}) and covariance parameters (𝛉\boldsymbol{\theta}) – this is the REML likelihood. For spatial generalized linear models, this marginalization is more challenging. First define 𝐰=𝐗𝛃+𝛕+𝛜\mathbf{w} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon}. Our goal is to marginalize the joint distribution of 𝐲\mathbf{y} and 𝐰\mathbf{w} over 𝐰\mathbf{w} (and 𝛃\boldsymbol{\beta}) to obtain a distribution of only the data (𝐲\mathbf{y}), a dispersion parameter (φ\varphi), and covariance parameters (𝛉\boldsymbol{\theta}). To accomplish this feat, we use a hierarchical construction that treats the 𝐰\mathbf{w} as latent (i.e., unobserved) variables and then use the Laplace approximation to perform integration. We briefly describe this approach next, but it is described in full detail in Ver Hoef, Blagg, et al. (2023).

The marginal distribution of interest can be written hierarchically as [𝐲|𝐗,φ,𝛉]=𝐰[𝐲|g1(𝐰),φ][𝐰|𝐗,𝛉]d𝐰.\begin{equation*} [\mathbf{y}|\mathbf{X}, \varphi, \boldsymbol{\theta}] = \int_\mathbf{w} [\mathbf{y} | g^{-1}(\mathbf{w}), \varphi] [\mathbf{w} | \mathbf{X}, \boldsymbol{\theta}] d\mathbf{w} . \end{equation*} The term [𝐲|g1(𝐰),φ][\mathbf{y}|g^{-1}(\mathbf{w}), \varphi] is likelihood of the generalized linear model with mean function g1(𝐰)g^{-1}(\mathbf{w}), and the term [𝐰|𝐗,𝛉][\mathbf{w}|\mathbf{X}, \boldsymbol{\theta}] is the restricted log likelihood for 𝐰\mathbf{w} given the covariance parameters.

Next define 𝐰=ln([𝐲|g1(𝐰),φ][𝐰|𝐗,𝛉])\ell_\mathbf{w} = \ln([\mathbf{y} | g^{-1}(\mathbf{w}), \varphi] [\mathbf{w} | \mathbf{X}, \boldsymbol{\theta}]). Let 𝐠\mathbf{g} be the gradient vector where gi=𝐰wig_i = \frac{\partial \ell_\mathbf{w}}{\partial w_i} and 𝐆\mathbf{G} be the Hessian matrix with ijijth element Gi,j=2𝐰wiwjG_{i, j} = \frac{\partial^2 \ell_\mathbf{w}}{\partial w_i \partial w_j}. Using a multivariate Taylor series expansion around some point 𝐚\mathbf{a}, 𝐰exp(𝐰)d𝐰𝐰exp(𝐚+𝐠(𝐰𝐚)+12(𝐰𝐚)𝐆(𝐰𝐚))d𝐰.\begin{equation*} \int_\mathbf{w} \exp(\ell_\mathbf{w}) d\mathbf{w} \approx \int_\mathbf{w} \exp(\ell_\mathbf{a} + \mathbf{g}^\top(\mathbf{w} - \mathbf{a}) + \frac{1}{2}(\mathbf{w} - \mathbf{a})^\top \mathbf{G} (\mathbf{w} - \mathbf{a})) d\mathbf{w}. \end{equation*} If 𝐚\mathbf{a} is the value at which 𝐠=𝟎\mathbf{g} = \mathbf{0}, then 𝐰exp(𝐰)d𝐰exp(𝐚)𝐰exp(12(𝐰𝐚)(𝐆)(𝐰𝐚))d𝐰=exp(𝐚)(2π)n/2|𝐆a|1/2,\begin{equation*} \int_\mathbf{w} \exp(\ell_\mathbf{w}) d\mathbf{w} \approx \exp(\ell_\mathbf{a}) \int_\mathbf{w} \exp \left(-\frac{1}{2} (\mathbf{w} - \mathbf{a})^\top (- \mathbf{G})(\mathbf{w} - \mathbf{a}) \right) d\mathbf{w} = \exp(\ell_\mathbf{a}) (2 \pi)^{n/2} |-\mathbf{G}_a|^{-1/2}, \end{equation*} where 𝐆a\mathbf{G}_a is the Hessian evaluated at 𝐚\mathbf{a} and |||\cdot| is the determinant operator. The previous result follows from the normalizing constant of a Gaussian distribution with kernel exp(12(𝐰𝐚)[(𝐆)1]1(𝐰𝐚))\exp(\frac{1}{2} (\mathbf{w} - \mathbf{a})^\top [(- \mathbf{G})^{-1}]^{-1}(\mathbf{w} - \mathbf{a})). Finally, we arrive at 𝐰exp(𝐰)d𝐰[𝐲|g1(𝐚),φ][𝐚|𝐗,𝛉](2π)n/2|𝐆a|1/2|,\begin{equation*} \int_\mathbf{w} \exp(\ell_\mathbf{w}) d\mathbf{w} \approx [\mathbf{y} | g^{-1}(\mathbf{a}), \varphi] [\mathbf{a} | \mathbf{X}, \boldsymbol{\theta}] (2 \pi)^{n/2} |-\mathbf{G}_a|^{-1/2}|, \end{equation*} which is a distribution that has been marginalized over the latent 𝐰\mathbf{w} and depends only on the data, a dispersion parameter, and the covariance parameters. The approach we outlined to solve this integral is known as the Laplace approximation.

AIC() and AICc()

AIC() and AICc() for spatial generalized linear models is defined the same as for spatial linear models.

anova()

anova() for spatial generalized linear models is defined the same as for spatial linear models.

AUROC()

AUROC() is the area under the receiver operating characteristic curve and is relevant for binomial (i.e., logistic) regression models where each response represents a single success or failure (i.e.., is binary). The AUROC ranges from zero to one and is a measure of the model’s classification accuracy averaged over all possible threshold values. More formally, it represents the probability that a randomly chosen success (datum value of one) has a larger fitted value than the fitted value of a randomly chosen failure (datum value of zero), with an adjustment for ties in the fitted values (Muschelli III 2020). AUROC() in spmodel leverages the auc() function in pROC Robin et al. (2011). For more on the AUROC, see Hanley and McNeil (1982) and Fawcett (2006).

BIC()

BIC() for spatial generalized linear models is defined the same as for spatial linear models.

coef()

coef() for spatial generalized linear models is defined the same as for spatial linear models.

confint()

confint() for spatial generalized linear models is defined the same as for spatial linear models.

cooks.distance()

The Cook’s distance is defined as the standard generalized linear model Cook’s distance after conditioning on 𝐰\mathbf{w}. That is, after conditioning on 𝐰\mathbf{w}, the Cook’s distance is 𝐞p2pdiag(𝐇c)11diag(𝐇c),\begin{equation} \frac{\mathbf{e}_p^2}{p} \odot diag(\mathbf{H}_c) \odot \frac{1}{1 - diag(\mathbf{H}_c)}, \end{equation} where 𝐞p2\mathbf{e}_p^2 are the Pearson residuals, diag(𝐇c)diag(\mathbf{H}_c) is the diagonal of the leverage (hat) matrix conditional on 𝐰\mathbf{w} and given by 𝐇c=𝐗v(𝐗v𝐗v)1𝐗v\mathbf{H}_c = \mathbf{X}_v (\mathbf{X}_v^\top\mathbf{X}_v)^{-1} \mathbf{X}_v^\top, where 𝐗v=𝐕1/2𝐗\mathbf{X}_v = \mathbf{V}^{1/2}\mathbf{X} and 𝐕\mathbf{V} is a diagonal matrix with iith diagonal element equal to Var(wi)\text{Var}(w_i), and \odot denotes the Hadmard (element-wise) product.

deviance()

The deviance is defined conditional on 𝐰\mathbf{w} (i.e., we find the deviance of the conditional model). It is derived by taking the deviance definitions from each generalized linear models distribution (defined later) and evaluating μi\mu_i at wiw_i.

fitted()

The fitted values on the link scale (type = "link") are given by 𝐰\mathbf{w}. The fitted values on the response scale (type = "response") are given by g1(𝐰)g^{-1}(\mathbf{w}). The fitted values for the spatial random errors (type = "spcov") and random effects (type = "randcov") are derived similarly as for spatial linear models but treat 𝐰\mathbf{w} as the response instead of 𝐲\mathbf{y} (and are on the link scale).

hatvalues()

The leverage (hat) matrix is obtained by finding the standard generalized linear model leverage (hat) matrix after conditioning on 𝐰\mathbf{w}. That is, after conditioning on 𝐰\mathbf{w} the leverage (hat) matrix, 𝐇c\mathbf{H}_c, is 𝐇c=𝐗v(𝐗v𝐗v)1𝐗v,\begin{equation} \mathbf{H}_c = \mathbf{X}_v (\mathbf{X}_v^\top\mathbf{X}_v)^{-1} \mathbf{X}_v^\top, \end{equation} where 𝐗v=𝐕1/2𝐗\mathbf{X}_v = \mathbf{V}^{1/2}\mathbf{X} and 𝐕\mathbf{V} is a diagonal matrix with iith diagonal element equal to Var(wi)\text{Var}(\text{w}_i).

logLik()

logLik() for spatial generalized linear models is defined the same as for spatial linear models, in that the log-likelihood is returned. The log-likelihood for spatial generalized linear models is given by p(φ,𝛉)=ln([𝐲|g1(𝐚),φ])+ln([𝐚|𝐗,𝛉])+n2ln(2π)12ln(|𝐆𝐚|).\begin{equation*} \ell_p(\varphi, \boldsymbol{\theta}) = \ln([\mathbf{y} | g^{-1}(\mathbf{a}), \varphi]) + \ln([\mathbf{a} | \mathbf{X}, \boldsymbol{\theta}]) + \frac{n}{2} \ln(2 \pi) - \frac{1}{2} \ln(|- \mathbf{G}_{\mathbf{a}}|). \end{equation*}

loocv()

loocv() for spatial generalized linear models is defined similarly as for spatial linear models, except that 𝐰\mathbf{w} is predicted instead of 𝐲\mathbf{y}. Then g1(𝐰)g^{-1}(\mathbf{w}) is compared to 𝐲\mathbf{y} to compute mean-squared-prediction-error. That is, mspe=1ni=1n(yig1(wi))2.\begin{equation*} mspe = \frac{1}{n}\sum_{i = 1}^n(y_i - g^{-1}(w_i))^2. \end{equation*}

When cv_fitted = TRUE, the predictions of held-out 𝐰\mathbf{w} are returned. The standard errors of these predictions (of held-out 𝐰\mathbf{w}) are returned when se.fit = TRUE. Note that 𝐆i\mathbf{G}_{-i} is determined from 𝐆\mathbf{G} using Helmert-Wolf blocking as is done for 𝚺i\boldsymbol{\Sigma}_{-i} and 𝚺\boldsymbol{\Sigma}.

Big Data

loocv() for big data spatial generalized linear models are defined similarly as for big data spatial linear models, except that 𝐰\mathbf{w} is predicted instead of 𝐲\mathbf{y}. Additionally, 𝐆l,l\mathbf{G}_{l, l} is determined from each local neighborhood as is done for 𝚺l,l\boldsymbol{\Sigma}_{l, l}.

predict()

interval = "none"

Building from the previously defined empirical best linear unbiased predictions (i.e., empirical Kriging predictions), predictions of 𝐰u\mathbf{w}_u are given on the link scale (type = "link") by 𝐰̇u=𝐗u𝛃̂+𝚺̂uo𝚺̂o1(𝐰̂o𝐗o𝛃̂).\begin{equation}\label{eq:glm_blup} \mathbf{\dot{w}}_u = \mathbf{X}_u \hat{\boldsymbol{\beta}} + \hat{\boldsymbol{\Sigma}}_{uo} \hat{\boldsymbol{\Sigma}}^{-1}_{o} (\hat{\mathbf{w}}_o - \mathbf{X}_o \hat{\boldsymbol{\beta}}) . \end{equation} These predictions are given on the response (inverse link) scale (type = "response") as g1(𝐰̇u)g^{-1}(\mathbf{\dot{w}}_u).

Similar to the covariance matrix of 𝛃̂\hat{\boldsymbol{\beta}}, the covariance matrix of 𝐰̇u\mathbf{\dot{w}}_u requires an adjustment to account for the fact that the 𝐰\mathbf{w} are not actually observed. First let 𝚲=𝐗u𝐁+𝚺̂uo𝚺̂o1+𝚺̂uo𝚺̂o1𝐗o𝐁\boldsymbol{\Lambda} = \mathbf{X}_u \mathbf{B} + \hat{\boldsymbol{\Sigma}}_{uo} \hat{\boldsymbol{\Sigma}}_o^{-1} + \hat{\boldsymbol{\Sigma}}_{uo} \hat{\boldsymbol{\Sigma}}_o^{-1} \mathbf{X}_o \mathbf{B} and 𝐁=(𝐗𝚺̂o1𝐗)1𝐗𝚺̂o1\mathbf{B} = (\mathbf{X}^\top \hat{\boldsymbol{\Sigma}}_o^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \hat{\boldsymbol{\Sigma}}_o^{-1}, and note that 𝐰̇u=Λ𝐰̂o\mathbf{\dot{w}}_u = \Lambda \hat{\mathbf{w}}_o. Using the law of conditional variance and conditioning on 𝐰o\mathbf{w}_o as if we had observed them, it follows that Cov(𝐰̂u𝐰u)=Cov(Λ𝐰̂o𝐰u)=E𝐰o[Cov(Λ𝐰̂o𝐰u|𝐰o)]+Cov𝐰o[E(Λ𝐰̂o𝐰u|𝐰o)],\begin{equation*} \text{Cov}(\hat{\mathbf{w}}_u - \mathbf{w}_u) = \text{Cov}(\Lambda \hat{\mathbf{w}}_o - \mathbf{w}_u) = \text{E}_{\mathbf{w}_o}[\text{Cov}(\Lambda \hat{\mathbf{w}}_o - \mathbf{w}_u | \mathbf{w}_o)] + \text{Cov}_{\mathbf{w}_o}[\text{E}(\Lambda \hat{\mathbf{w}}_o - \mathbf{w}_u | \mathbf{w}_o)], \end{equation*} We assume 𝐰̂o\hat{\mathbf{w}}_o is unbiased for 𝐰o\mathbf{w}_o (i.e., E(𝐰̂o|𝐰o)=𝐰o\text{E}(\hat{\mathbf{w}}_o | \mathbf{w}_o) = \mathbf{w}_o). Then Cov𝐰o[E(Λ𝐰̂o𝐰u|𝐰o)]=Cov𝐰o(Λ𝐰o𝐰u)=𝚺̂u𝚺̂uo𝚺̂o1𝚺̂uo+𝐐(𝐗o𝚺̂o1𝐗o)1𝐐\text{Cov}_{\mathbf{w}_o}[\text{E}(\Lambda \hat{\mathbf{w}}_o - \mathbf{w}_u | \mathbf{w}_o)] = \text{Cov}_{\mathbf{w}_o}(\Lambda \mathbf{w}_o - \mathbf{w}_u) = \hat{\boldsymbol{\Sigma}}_u - \hat{\boldsymbol{\Sigma}}_{uo} \hat{\boldsymbol{\Sigma}}^{-1}_o \hat{\boldsymbol{\Sigma}}^\top_{uo} + \mathbf{Q}(\mathbf{X}_o^\top \hat{\boldsymbol{\Sigma}}_o^{-1} \mathbf{X}_o)^{-1}\mathbf{Q}^\top, where 𝐐=𝐗u𝚺̂uo𝚺̂o1𝐗o\mathbf{Q} = \mathbf{X}_u - \hat{\boldsymbol{\Sigma}}_{uo} \hat{\boldsymbol{\Sigma}}^{-1}_o \mathbf{X}_o, the usual form for the mean squared prediction error of 𝐰̂u\hat{\mathbf{w}}_u. Next note that (viewing 𝐰u\mathbf{w}_u as a constant) E𝐰o[Cov(Λ𝐰̂o𝐰u|𝐰o)]\text{E}_{\mathbf{w}_o}[\text{Cov}(\Lambda \hat{\mathbf{w}}_o - \mathbf{w}_u | \mathbf{w}_o)] can be viewed as the observed Fisher Information inverse, 𝐆1- \mathbf{G}^{-1}. Evaluating 𝐆\mathbf{G} at 𝐚\mathbf{a} eventually yields the adjusted covariance matrix of Λ𝐰̂o𝐰u\Lambda \hat{\mathbf{w}}_o - \mathbf{w}_u (var_correct = TRUE) given by Cov(Λ𝐰̂o𝐰u)=𝚺̇u=Λ(𝐆𝐚)1Λ+𝚺̂u𝚺̂uo𝚺̂o1𝚺̂uo+𝐐(𝐗o𝚺̂o1𝐗o)1𝐐.\begin{equation}\label{eq:glm_blup_cov} \text{Cov}(\Lambda \hat{\mathbf{w}}_o - \mathbf{w}_u) = \dot{\boldsymbol{\Sigma}}_u = \Lambda (-\mathbf{G}_\mathbf{a})^{-1} \Lambda^\top + \hat{\boldsymbol{\Sigma}}_u - \hat{\boldsymbol{\Sigma}}_{uo} \hat{\boldsymbol{\Sigma}}^{-1}_o \hat{\boldsymbol{\Sigma}}^\top_{uo} + \mathbf{Q}(\mathbf{X}_o^\top \hat{\boldsymbol{\Sigma}}_o^{-1} \mathbf{X}_o)^{-1}\mathbf{Q}^\top . \end{equation}

The unadjusted covariance matrix of Λ𝐰̂o𝐰u\Lambda \hat{\mathbf{w}}_o - \mathbf{w}_u (var_correct = FALSE) can also be returned and is given by Cov(Λ𝐰̂o𝐰u)=𝚺̇u=𝚺̂u𝚺̂uo𝚺̂o1𝚺̂uo+𝐐(𝐗o𝚺̂o1𝐗o)1𝐐.\begin{equation}\label{eq:glm_blup_cov_unadj} \text{Cov}(\Lambda \hat{\mathbf{w}}_o - \mathbf{w}_u) = \dot{\boldsymbol{\Sigma}}_u = \hat{\boldsymbol{\Sigma}}_u - \hat{\boldsymbol{\Sigma}}_{uo} \hat{\boldsymbol{\Sigma}}^{-1}_o \hat{\boldsymbol{\Sigma}}^\top_{uo} + \mathbf{Q}(\mathbf{X}_o^\top \hat{\boldsymbol{\Sigma}}_o^{-1} \mathbf{X}_o)^{-1}\mathbf{Q}^\top . \end{equation}

When se.fit = TRUE, standard errors are returned on the link scale by taking the square root of the diagonal of the relevant 𝚺̇u\dot{\boldsymbol{\Sigma}}_u.

interval = "prediction"

Predictions of 𝐰u\mathbf{w}_u are returned on the link scale (type = "link") by evaluating 𝐰̇u\mathbf{\dot{w}}_u. The (100 ×\timeslevel)% prediction interval for (wu)i(w_u)_i is (ẇu)i±z*(𝚺̇u)i,i(\dot{w}_u)_i \pm z^* \sqrt{(\dot{\boldsymbol{\Sigma}}_u)_{i, i}}, where (𝚺̇u)i,i\sqrt{(\dot{\boldsymbol{\Sigma}}_u)_{i, i}} is the standard error of (ẇu)i(\dot{w}_u)_i obtained from se.fit = TRUE, Φ(z*)=1α/2\Phi(z^*) = 1 - \alpha / 2, Φ()\Phi(\cdot) is the standard normal (Gaussian) cumulative distribution function, α=1\alpha = 1 -level, and level is an argument to predict(). The default for level is 0.95, which corresponds to a z*z^* of approximately 1.96. These predictions and corresponding prediction intervals are returned on the response scale (type = "response") by applying g1()g^{-1}(\cdot) (inverse link) to (ẇu)i(\dot{w}_u)_i (the prediction), (ẇu)iz*(𝚺̇u)i,i(\dot{w}_u)_i - z^* \sqrt{(\dot{\boldsymbol{\Sigma}}_u)_{i, i}} (the prediction interval lower bound), and (ẇu)i+z*(𝚺̇u)i,i(\dot{w}_u)_i + z^* \sqrt{(\dot{\boldsymbol{\Sigma}}_u)_{i, i}} (the prediction interval upper bound). Note that the prediction intervals are symmetric on the link scale but are not generally symmetric on the response scale. One could obtain symmetric prediction intervals on the response scale using the delta method (Ver Hoef 2012).

interval = "confidence"

Estimates for (𝐗u)i𝛃(\mathbf{X}_u)_i \boldsymbol{\beta} (the fixed effects portion of the model) are returned on the link scale (type = "link") by evaluating (𝐗u)i𝛃̂(\mathbf{X}_u)_i \hat{\boldsymbol{\beta}} (i.e., fitted values corresponding to (𝐗u)i)(\mathbf{X}_u)_i). The (100 ×\timeslevel)% confidence interval for (𝐗u)i𝛃(\mathbf{X}_u)_i \boldsymbol{\beta} is (𝐗u)i𝛃̂±z*(𝐗u)i[𝐁(𝐆a)1𝐁+(𝐗𝚺̂o1𝐗)1](𝐗u)i(\mathbf{X}_u)_i \hat{\boldsymbol{\beta}} \pm z^* \sqrt{(\mathbf{X}_u)_i [\mathbf{B} (-\mathbf{G}_a)^{-1} \mathbf{B}^\top + (\mathbf{X}^\top \hat{\boldsymbol{\Sigma}}^{-1}_o \mathbf{X})^{-1}](\mathbf{X}_u)_i^\top}, where (𝐗u)i(\mathbf{X}_u)_i is the iith row of 𝐗u\mathbf{X}_u, 𝐁(𝐆a)1𝐁+(𝐗𝚺̂1𝐗)1\mathbf{B} (-\mathbf{G}_a)^{-1} \mathbf{B}^\top + (\mathbf{X}^\top \hat{\boldsymbol{\Sigma}}^{-1} \mathbf{X})^{-1} is the standard error of (𝐗u)i𝛃̂(\mathbf{X}_u)_i \hat{\boldsymbol{\beta}} obtained from se.fit = TRUE, Φ(z*)=1α/2\Phi(z^*) = 1 - \alpha / 2, Φ()\Phi(\cdot) is the standard normal (Gaussian) cumulative distribution function, α=1\alpha = 1 -level, and level is an argument to predict(). The default for level is 0.95, which corresponds to a z*z^* of approximately 1.96. These estimates and corresponding confidence intervals are returned on the response scale (type = "response") by applying g1()g^{-1}(\cdot) (inverse link) to (𝐗u)i𝛃̂(\mathbf{X}_u)_i \hat{\boldsymbol{\beta}} (the estimate), (𝐗u)i𝛃̂z*(𝐗u)i[𝐁(𝐆a)1𝐁+(𝐗𝚺̂o1𝐗)1](𝐗u)i(\mathbf{X}_u)_i \hat{\boldsymbol{\beta}} - z^* \sqrt{(\mathbf{X}_u)_i [\mathbf{B} (-\mathbf{G}_a)^{-1} \mathbf{B}^\top + (\mathbf{X}^\top \hat{\boldsymbol{\Sigma}}^{-1}_o \mathbf{X})^{-1}](\mathbf{X}_u)_i^\top} (the confidence interval lower bound), and (𝐗u)i𝛃̂+z*(𝐗u)i[𝐁(𝐆a)1𝐁+(𝐗𝚺̂o1𝐗)1](𝐗u)i(\mathbf{X}_u)_i \hat{\boldsymbol{\beta}} + z^* \sqrt{(\mathbf{X}_u)_i [\mathbf{B} (-\mathbf{G}_a)^{-1} \mathbf{B}^\top + (\mathbf{X}^\top \hat{\boldsymbol{\Sigma}}^{-1}_o \mathbf{X})^{-1}](\mathbf{X}_u)_i^\top} (the confidence interval upper bound). Note that the confidence intervals are symmetric on the link scale but are generally not symmetric on the response scale. One could obtain symmetric confidence intervals on the response scale using the delta method (Ver Hoef 2012).

spgautor() extra steps

The extra step required to obtain 𝚺̂o1\hat{\boldsymbol{\Sigma}}^{-1}_o, 𝚺̂u\hat{\boldsymbol{\Sigma}}_u, and 𝚺̂uo\hat{\boldsymbol{\Sigma}}_{uo} is the same for spatial generalized autoregressive models as it is for spatial autoregressive models.

Big Data

predict() for big data spatial generalized linear models is defined similarly as for big data spatial linear models, except that 𝐰\mathbf{w} are subset or predicted (instead of 𝐲\mathbf{y}) to find 𝐰̌o\check{\mathbf{w}}_o (instead of 𝐲̌o\check{\mathbf{y}}_o). It standard errors are required, 𝐆̌o\check{\mathbf{G}}_o is also found.

pseudoR2()

pseudoR2() for spatial generalized linear models is defined the same as for spatial linear models.

residuals()

The residuals are obtained by applying standard generalized linear model definitions after conditioning on 𝐰\mathbf{w}.

When type = "response", response residuals are returned: 𝐞r=𝐲g1(𝐰).\begin{equation*} \mathbf{e}_{r} = \mathbf{y} - g^{-1}(\mathbf{w}). \end{equation*}

When type = "pearson", Pearson residuals are returned: 𝐞p=𝐕1/2𝐞r,\begin{equation*} \mathbf{e}_{p} = \mathbf{V}^{-1/2}\mathbf{e}_{r}, \end{equation*} where 𝐕\mathbf{V} is a diagonal matrix with iith diagonal element equal to Var(wi)\text{Var}(w_i).

When type = "deviance", deviance residuals are returned: 𝐞d=sign(𝐞r)deviancei,\begin{equation*} \mathbf{e}_{d} = \text{sign}(\mathbf{e}_r) \odot \sqrt{\text{deviance}_i}, \end{equation*} where deviancei\text{deviance}_i is the deviance of yiy_i (conditional on wiw_i) and \odot denotes the Hadmard (element-wise) product.

When type = "standardized", standardized residuals are returned: 𝐞s=𝐞d11diag(𝐇c),\begin{equation*} \mathbf{e}_{s} = \mathbf{e}_{d} \odot \frac{1}{\sqrt{1 - diag(\mathbf{H}_c)}}, \end{equation*} where diag(𝐇c)diag(\mathbf{H}_c) is the diagonal of the leverage (hat) matrix conditional on 𝐰\mathbf{w} and given by 𝐇c𝐗v(𝐗v𝐗v)1𝐗v\mathbf{H}_c \equiv \mathbf{X}_v (\mathbf{X}_v^\top\mathbf{X}_v)^{-1} \mathbf{X}_v^\top, where 𝐗v=𝐕1/2𝐗\mathbf{X}_v = \mathbf{V}^{1/2}\mathbf{X}, and \odot denotes the Hadmard (element-wise) product.

spgautor() and spglm()

Many of the details regarding spglm() and spgautor() for spatial generalized linear models are the same as splm() and spautor() for spatial linear models, though occasional differences are noted in the following subsection headers.

spgautor() Spatial Covariance Functions

Covariance functions for spgautor() are defined the same as covariance functions for spautor().

spglm() Spatial Covariance Functions

Covariance functions for spglm() are defined the same as covariance functions for splm().

Model-fitting

Recall that the likelihood of interest is 𝐰exp(𝐰)d𝐰[𝐲|g1(𝐚),φ][𝐚|𝐗,𝛉](2π)n/2|𝐆a|1/2|,\begin{equation*} \int_\mathbf{w} \exp(\ell_\mathbf{w}) d\mathbf{w} \approx [\mathbf{y} | g^{-1}(\mathbf{a}), \varphi] [\mathbf{a} | \mathbf{X}, \boldsymbol{\theta}] (2 \pi)^{n/2} |-\mathbf{G}_a|^{-1/2}|, \end{equation*} and that 𝐚\mathbf{a} is the value at which the gradient, 𝐠\mathbf{g}, equals zero. Given 𝐚\mathbf{a}, minus twice a profiled (by 𝛃\boldsymbol{\beta}) marginal Laplace log-likelihood is given by 2p(φ,𝛉)=2ln([𝐲|g1(𝐚),φ])2ln([𝐚|𝐗,𝛉])nln(2π)+ln(|𝐆𝐚|).\begin{equation*} -2\ell_p(\varphi, \boldsymbol{\theta}) = -2\ln([\mathbf{y} | g^{-1}(\mathbf{a}), \varphi]) -2\ln([\mathbf{a} | \mathbf{X}, \boldsymbol{\theta}]) -n \ln(2 \pi) + \ln(|- \mathbf{G}_{\mathbf{a}}|). \end{equation*} Note that ln[𝐚|𝐗,𝛉]\ln[\mathbf{a} | \mathbf{X}, \boldsymbol{\theta}] are the ML or REML log-likelihood equations the spatial linear model, respectively, where now 𝛃̃=(𝐗𝚺1𝐗)1𝐗𝚺1𝐚\tilde{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{\Sigma}^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{\Sigma}^{-1} \mathbf{a}.

Assuming 𝐲\mathbf{y} and g1(𝐰)g^{-1}(\mathbf{w}) are conditionally independent and φ\varphi and 𝛉\boldsymbol{\theta} are known, it can be shown that maximizing 𝐰\ell_{\mathbf{w}} (with respect to 𝐰\mathbf{w}) amounts to maximizing (up to a constant) i=1Nln[yi|wi,φ]12(𝐰𝐗𝛃̂)𝚺1(𝐰𝐗𝛃̂).\begin{equation*} \sum_{i = 1}^N \ln[y_i | w_i, \varphi] - \frac{1}{2}(\mathbf{w} - \mathbf{X} \hat{\boldsymbol{\beta}})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{w} - \mathbf{X}\hat{\boldsymbol{\beta}}). \end{equation*} Thus the gradient 𝐰\ell_{\mathbf{w}} can be shown to equal 𝐠=𝐝𝚺1𝐰+𝚺1𝐗𝛃̂=𝐝𝐏𝐰,\begin{equation*} \mathbf{g} = \mathbf{d} - \boldsymbol{\Sigma}^{-1}\mathbf{w} + \boldsymbol{\Sigma}^{-1}\mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{d} - \mathbf{P}\mathbf{w}, \end{equation*} where di=ln[yi|g1(wi),φ]/wid_i = \partial \ln[y_i | g^{-1}(w_i), \varphi] / \partial w_i and 𝐏=𝚺1𝚺1𝐗(𝐗𝚺1𝐗)1𝐗𝚺1\mathbf{P} = \boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1} \mathbf{X} (\mathbf{X}^\top \boldsymbol{\Sigma}^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \boldsymbol{\Sigma}^{-1}. And the Hessian of 𝐰\ell_{\mathbf{w}} is 𝐆=𝐃𝐏,\begin{equation*} \mathbf{G} = \mathbf{D} - \mathbf{P}, \end{equation*} where Di,i=2ln[yi|g1(wi),φ]/wi2D_{i, i} = \partial^2 \ln[y_i | g^{-1}(w_i), \varphi] / \partial w_i^2. Note that Di,j=0D_{i, j} = 0 for iji \neq j because of conditional independence.

Next the Newton-Rhapson algorithm can be used to maximize 𝐰\ell_{\mathbf{w}}, where an update is given by 𝐰k+1=𝐰kα𝐆1𝐠.\begin{equation*} \mathbf{w}^{k + 1} = \mathbf{w}^k - \alpha \mathbf{G}^{-1}\mathbf{g}. \end{equation*} where 0<α10 < \alpha \leq 1. Typically, α=1\alpha = 1, but it can be lowered if the 𝐠\mathbf{g} is unstable. Generally, the Newton-Rhapson converges rapidly. The value of 𝐰\mathbf{w} at convergence is defined as 𝐚\mathbf{a}.

It follows that finding φ\varphi and 𝛉\boldsymbol{\theta} for unknown 𝐚\mathbf{a} requires a doubly-iterative algorithm. First, a value of 𝐚\mathbf{a} is proposed (e.g., 𝐚=𝟎\mathbf{a} = \mathbf{0}). Then given 𝐚0\mathbf{a}_0, the Laplace log-likelihood is maximized, yielding φ̂0\hat{\varphi}_0 and 𝛉̂0\hat{\boldsymbol{\theta}}_0. Then given φ̂0\hat{\varphi}_0 and 𝛉̂0\hat{\boldsymbol{\theta}}_0, 𝐰\ell_{\mathbf{w}} is maximized, yielding 𝐚1\mathbf{a}_1. Then given 𝐚1\mathbf{a}_1, the Laplace log-likelihood is maximized, yielding φ̂1\hat{\varphi}_1 and 𝛉̂1\hat{\boldsymbol{\theta}}_1. Then given φ̂1\hat{\varphi}_1 and 𝛉̂1\hat{\boldsymbol{\theta}}_1, 𝐰\ell_{\mathbf{w}} is maximized, yielding 𝐚2\mathbf{a}_2. This process continues until convergence, yielding optimized values for φ\varphi and 𝛉\boldsymbol{\theta} and, using these optimized values, a value of 𝐚\mathbf{a}.

Note that the Laplace approximation incorporates a likelihood, and as a result, the only estimation methods available via the estmethod argument are "reml" (the default) and "ml". The doubly-iterative algorithm used to fit spatial generalized linear models is far more computationally expensive than fitting spatial linear models.

Optimization

Optimization for spglm() and spgautor() works as it does for splm() and spautor(), with one additional step. The convergence criteria for 𝐰\mathbf{w} (within each covariance parameter iteration) is achieved when the largest absolute value of 𝐰k𝐰k1\mathbf{w}_k - \mathbf{w}_{k - 1} is less than 1/1041/10^4 or k>50k > 50 (kk indexes the Newton-Rhapson iterations).

Grid Search

The grid search for spglm() and spgautor() works as it does for splm() and spautor() except that for spglm() and spgautor(), the grid search initial values are on the link scale and the grid search sample variance is calculated by regressing ln(𝐲+1)\ln(\mathbf{y} + 1) on 𝐗\mathbf{X} instead of 𝐲\mathbf{y} on 𝐗\mathbf{X}. For negative binomial, beta, gamma, and inverse Gaussian families, the initial value of the dispersion parameter is set to one.

Hypothesis Testing

Hypothesis testing for spatial generalized linear models is defined the same as for spatial linear models. That is, the hypothesis tests are asymptotic z-tests based on the normal (Gaussian) distribution (Wald tests). The null hypothesis for the test associated with each β̂i\hat{\beta}_i is that βi=0\beta_i = 0. The spatial generalized linear models, Cov(𝛃̂)\text{Cov}(\hat{\boldsymbol{\beta}}) requires an adjustment to account for the fact that the 𝐰\mathbf{w} are not actually observed. First let 𝐁=(𝐗𝚺1𝐗)1𝐗𝚺1\mathbf{B} = (\mathbf{X}^\top \boldsymbol{\Sigma}^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \boldsymbol{\Sigma}^{-1} and note that 𝛃̂=𝐁𝐰\hat{\boldsymbol{\beta}} = \mathbf{B}\mathbf{w}. Using the law of conditional variance and conditioning on 𝐰\mathbf{w} as if we had observed them, it follows that Cov(𝐁𝐰̂)=E𝐰[Cov(𝐁𝐰̂|𝐰)]+Cov𝐰[E(𝐁𝐰̂|𝐰)]\begin{equation*} \text{Cov}(\mathbf{B} \hat{\mathbf{w}}) = \text{E}_\mathbf{w}[\text{Cov}(\mathbf{B}\hat{\mathbf{w}} | \mathbf{w})] + \text{Cov}_\mathbf{w}[\text{E}(\mathbf{B}\hat{\mathbf{w}} | \mathbf{w})] \end{equation*} We assume 𝐰̂\hat{\mathbf{w}} is unbiased for 𝐰\mathbf{w} (i.e., E(𝐰̂|𝐰)=𝐰\text{E}(\hat{\mathbf{w}} | \mathbf{w}) = \mathbf{w}). Then Cov(E(𝐁𝐰̂|𝐰))=Cov(𝐁𝐰)=𝐁𝚺𝐁\text{Cov}(\text{E}(\mathbf{B} \hat{\mathbf{w}} | \mathbf{w})) = \text{Cov}(\mathbf{B}\mathbf{w}) = \mathbf{B} \boldsymbol{\Sigma} \mathbf{B}^\top, which reduces to (𝐗𝚺1𝐗)1(\mathbf{X}^\top \boldsymbol{\Sigma}^{-1} \mathbf{X})^{-1}, the usual form for Cov(𝛃̂)\text{Cov}(\hat{\boldsymbol{\beta}}). Next note that Cov(𝐰̂|𝐰)\text{Cov}(\hat{\mathbf{w}} | \mathbf{w}) can be viewed as the inverse of the observed Fisher Information, which is 𝐆1-\mathbf{G}^{-1}, which depends on 𝐰\mathbf{w} through the diagonal elements in 𝐃\mathbf{D}. Evaluating 𝐆-\mathbf{G} at 𝐚\mathbf{a} yields the adjusted covariance matrix of 𝛃̂\hat{\boldsymbol{\beta}} given by Cov(𝛃̂)=𝐁(𝐆a)1𝐁+(𝐗𝚺1𝐗)1.\begin{equation*} \text{Cov}(\hat{\boldsymbol{\beta}}) = \mathbf{B} (-\mathbf{G}_a)^{-1} \mathbf{B}^\top + (\mathbf{X}^\top \boldsymbol{\Sigma}^{-1} \mathbf{X})^{-1} . \end{equation*}

Random Effects (spglm() only)

Random effects for spatial generalized linear models are defined the same as for spatial linear models. Note that random effects for spatial generalized linear models are on the link scale.

Anisotropy (spglm() only)

Anisotropy for spatial generalized linear models are defined the same as for spatial linear models. Note that anisotropy parameters for spatial generalized linear models are on the link scale.

Partition Factors

Partition factors for spatial generalized linear models are defined the same as for spatial linear models.

Big Data (spglm() only)

Big data model-fitting for spatial generalized linear models is in many ways the same as for spatial linear models. The local argument behaves the same for spatial generalized linear models as it does for spatial linear models. This is because fundamentally, the “local” spatial indexing (SPIN) approach to representing 𝚺\boldsymbol{\Sigma} blockwise is still applied and serves as the basis for massive computational gains when fitting spatial generalized linear models (Ver Hoef, Dumelle, et al. 2023).

The additional step that is required to fit big data spatial generalized linear models involves efficiently manipulating the Hessian, 𝐆\mathbf{G}, to obtain its inverse and log determinant. Before providing further details, we review the Sherman-Morrison-Woodbury (SMW) formula (Sherman 1949; Sherman and Morrison 1950; Woodbury 1950). The SMW formula states that for an n×nn \times n matrix 𝐀\mathbf{A}, an n×kn \times k matrix 𝐔\mathbf{U}, a k×kk \times k matrix 𝐂\mathbf{C}, and a k×nk \times n matrix 𝐕\mathbf{V}, (𝐀+𝐔𝐂𝐕)1=𝐀1𝐀1𝐔(𝐂1+𝐕𝐀1𝐔)1𝐕𝐀1\begin{equation*} (\mathbf{A} + \mathbf{U} \mathbf{C} \mathbf{V})^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{U} (\mathbf{C}^{-1} + \mathbf{V}\mathbf{A}^{-1}\mathbf{U})^{-1} \mathbf{V} \mathbf{A}^{-1} \end{equation*} and |𝐀+𝐔𝐂𝐕|=|𝐀||𝐂||𝐂1+𝐕𝐀1𝐔|.\begin{equation*} |\mathbf{A} + \mathbf{U} \mathbf{C} \mathbf{V}| = |\mathbf{A}||\mathbf{C}||\mathbf{C}^{-1} + \mathbf{V}\mathbf{A}^{-1}\mathbf{U}|. \end{equation*} The determinant result above implies ln|𝐀+𝐔𝐂𝐕|=ln|𝐀|+ln|𝐂|+ln|𝐂1+𝐕𝐀1𝐔|.\begin{equation*} \ln|\mathbf{A} + \mathbf{U} \mathbf{C} \mathbf{V}| = \ln|\mathbf{A}| + \ln|\mathbf{C}| + \ln|\mathbf{C}^{-1} + \mathbf{V}\mathbf{A}^{-1}\mathbf{U}|. \end{equation*} The SMW formula is important because if the inverse and log determinant of 𝐀\mathbf{A} is efficient to compute and k<<nk << n, then the inverse and log determinant of the desired sum can also be efficient to compute. This is because except for 𝐀1\mathbf{A}^{-1} and |𝐀|\mathbf{|A|}, the SMW formula only requires finding k×kk \times k inverses and log determinants.

Recall that the Hessian can be written as 𝐆=𝐃𝚺1+𝚺1𝐗(𝐗𝚺1𝐗)1𝐗𝚺1,\begin{equation*} \mathbf{G} = \mathbf{D} - \boldsymbol{\Sigma}^{-1} + \boldsymbol{\Sigma}^{-1} \mathbf{X} (\mathbf{X}^\top \boldsymbol{\Sigma}^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \boldsymbol{\Sigma}^{-1}, \end{equation*} Because 𝚺\boldsymbol{\Sigma} can be represented blockwise, 𝚺1\boldsymbol{\Sigma}^{-1} can be represented blockwise and thus the inverse and log determinant can be efficiently computed. Because 𝐃\mathbf{D} is diagonal, 𝐃𝚺1\mathbf{D} - \boldsymbol{\Sigma}^{-1} can be represented blockwise and thus the inverse and log determinant can be efficiently computed. Then the SMW formula can be used, taking 𝐀=𝐃𝚺1\mathbf{A} = \mathbf{D} - \boldsymbol{\Sigma}^{-1}, 𝐔=𝚺1𝐗\mathbf{U} = \boldsymbol{\Sigma}^{-1} \mathbf{X}, 𝐂=(𝐗𝚺1𝐗)1\mathbf{C} = (\mathbf{X}^\top \boldsymbol{\Sigma}^{-1} \mathbf{X})^{-1}, and 𝐕=𝐗𝚺1\mathbf{V} = \mathbf{X}^\top \boldsymbol{\Sigma}^{-1}.

As previously mentioned, fitting big data spatial generalized linear models requires a doubly-iterative algorithm. This makes it far more computationally expensive than fitting big data spatial linear models.

Simulation Functions (sprpois(), sprnbinom(), sprbinom(), sprbeta(), sprgamma(), sprinvgauss())

Poisson, negative binomial, binomial, beta, gamma, and inverse Gaussian random variables can be simulated using sprpois(), sprnbinom(), sprbinom(), sprbeta(), sprgamma(), and sprinvgauss(), respectively. All of these functions work similarly. First, relevant arguments are passed to sprnorm() to simulate 𝐰\mathbf{w} on the link scale. Then using 𝐰\mathbf{w} and the dispersion parameter (when required), relevant generalized linear model random variables are simulated independently for each wiw_i. Note that the dispersion parameter is not required for sprpois() and sprbinom().

vcov()

The corrected variance-covariance matrix of the fixed effects (var_correct = TRUE) is given by 𝐁(𝐆a)1𝐁+(𝐗𝚺1𝐗)1\mathbf{B} (-\mathbf{G}_a)^{-1} \mathbf{B}^\top + (\mathbf{X}^\top \boldsymbol{\Sigma}^{-1} \mathbf{X})^{-1}. The uncorrected variance-covariance matrix (var_correct = FALSE) is given by Cov(𝛃̂)=(𝐗𝚺1𝐗)1\text{Cov}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^\top \boldsymbol{\Sigma}^{-1} \mathbf{X})^{-1}.

Distribution Parameterizations

Below we provide definitions associated with each of the six generalized linear models families available in spmodel. Note that the poisson, binomial, gamma, and inverse Gaussian distributions are members of the exponential family, and their deviance is typically expressed as twice the difference in log-likelihoods between the saturated and fitted model times the dispersion parameter (consistent with glm()). The negative binomial and beta distributions are members of the exponential family when the dispersion parameter (φ\varphi) is known, and their deviance is typically expressed as just twice the difference in log-likelihoods between the saturated and fitted model (consistent with glm.nb() from MASS (Venables and Ripley 2002) for negative binomial regression and betareg() from betareg (Cribari-Neto and Zeileis 2010) for beta regression).

Poisson Distribution

The Poisson distribution is defined as f(y|μ)=μyexp(μ)y!,\begin{equation*} f(y | \mu) = \frac{\mu^y \exp(-\mu)}{y!}, \end{equation*} where yy is a non-negative integer, μ>0\mu > 0, E(y)=μ\text{E}(y) = \mu, and Var(y)=μ\text{Var}(y) = \mu.

The log-likelihood (of a single observation) is defined as lnf(y|μ)=yln(μ)μln(y!).\begin{equation*} \ln f(y | \mu) = y \ln(\mu) - \mu - \ln(y!). \end{equation*}

Using the inverse log link and writing in terms of μ=exp(w)\mu = \exp(w), the log-likelihood can be written as lnf(y|μ)=ywexp(w)ln(y!).\begin{equation*} \ln f(y | \mu) = yw - \exp(w) - \ln(y!). \end{equation*}

The derivative of the Poisson distribution with respect to ww is ddwlnf(y|μ)=yexp(w).\begin{equation*} \frac{d}{d w} \ln f(y | \mu) = y - \exp(w). \end{equation*}

The second derivative of the Poisson distribution with respect to ww is d2dw2lnf(y|μ)=exp(w).\begin{equation*} \frac{d^2}{d w^2} \ln f(y | \mu) = - \exp(w). \end{equation*}

Twice the log-likelihood of the saturated model is 2lnf(𝐲|𝛍s)=iyiln(yi)yiln(yi!).\begin{equation*} 2 \ln f(\mathbf{y} | \boldsymbol{\mu}_s) = \sum_i y_i \ln(y_i) - y_i - \ln(y_i!). \end{equation*}

Twice the log-likelihood of the fitted (observed) model is 2lnf(𝐲|𝛍̂)=iyiln(μ̂i)μ̂iln(yi!).\begin{equation*} 2 \ln f(\mathbf{y} | \hat{\boldsymbol{\mu}}) = \sum_i y_i \ln(\hat{\mu}_i) - \hat{\mu}_i - \ln(y_i!). \end{equation*}

Thus the deviance is 2iyilog(yi/μ̂i)(yiμ̂i).\begin{equation*} 2 \sum_i y_i log(y_i / \hat{\mu}_i) - (y_i - \hat{\mu}_i). \end{equation*}

Negative Binomial Distribution

The negative binomial distribution is defined as f(y|μ,φ)=Γ(y+φ)Γ(φ)y!(μμ+φ)y(φμ+φ)φ,\begin{equation*} f(y | \mu, \varphi) = \frac{\Gamma(y + \varphi)}{\Gamma(\varphi) y!} \left( \frac{\mu}{\mu + \varphi} \right)^y \left( \frac{\varphi}{\mu + \varphi} \right)^\varphi, \end{equation*} where yy is a non-negative integer, μ>0\mu > 0, φ>0\varphi > 0, E(y)=μ\text{E}(y) = \mu, and Var(y)=μ+μ2φ\text{Var}(y) = \mu + \frac{\mu^2}{\varphi}.

The log-likelihood (of a single observation) is defined as lnf(y|μ,φ)=ln(Γ(y+φ))ln(Γ(φ))ln(y!)+yln(μ)yln(μ+φ)+φln(φ)φln(μ+φ).\begin{equation*} \ln f(y | \mu, \varphi) = \ln(\Gamma(y + \varphi)) - \ln(\Gamma(\varphi)) - \ln(y!) + y \ln(\mu) - y \ln(\mu + \varphi) + \varphi \ln(\varphi) - \varphi \ln(\mu + \varphi). \end{equation*}

Using the inverse log link and writing in terms of μ=exp(w)\mu = \exp(w), the log-likelihood can be written as lnf(y|μ)=ln(Γ(y+φ))ln(Γ(φ))ln(y!)+y[wln(exp(w)+φ)]+φ[ln(φ)ln(exp(w)+φ)].\begin{equation*} \ln f(y | \mu) = \ln(\Gamma(y + \varphi)) - \ln(\Gamma(\varphi)) - \ln(y!) + y [w - \ln(\exp(w) + \varphi)] + \varphi [\ln(\varphi) - \ln(\exp(w) + \varphi)]. \end{equation*}

The derivative of the negative binomial distribution with respect to ww is ddwlnf(y|μ)=y(1exp(w)exp(w)+φ)+φ(exp(w)exp(w)+φ)=yφexp(w)+φφexp(w)exp(w)+φ=φ(yexp(w))exp(w)+φ.\begin{equation*} \begin{split} \frac{d}{d w} \ln f(y | \mu) & = y\left(1 - \frac{\exp(w)}{\exp(w) + \varphi}\right) + \varphi \left(- \frac{\exp(w)}{\exp(w) + \varphi} \right) \\ & = \frac{y\varphi}{\exp(w) + \varphi} - \frac{\varphi \exp(w)}{\exp(w) + \varphi} \\ & = \frac{\varphi (y - \exp(w))}{\exp(w) + \varphi}. \end{split} \end{equation*}

The second derivative of the negative binomial distribution with respect to ww is d2dw2lnf(y|μ)=φexp(w)(y+φ)(φ+exp(w))2\begin{equation*} \frac{d^2}{d w^2} \ln f(y | \mu) = - \frac{\varphi \exp(w) (y + \varphi)}{(\varphi + \exp(w))^2} \end{equation*}

Twice the log-likelihood of the saturated model is 2lnf(𝐲|𝛍s)=iln(Γ(yi+φ))ln(Γ(φ))ln(yi!)+yiln(yi)yiln(yi+φ)+φln(φ)φln(yi+φ).\begin{equation*} 2 \ln f(\mathbf{y} | \boldsymbol{\mu}_s) = \sum_i \ln(\Gamma(y_i + \varphi)) - \ln(\Gamma(\varphi)) - \ln(y_i!) + y_i \ln(y_i) - y_i \ln(y_i + \varphi) + \varphi \ln(\varphi) - \varphi \ln(y_i + \varphi). \end{equation*}

Twice the log-likelihood of the fitted (observed) model is 2lnf(𝐲|𝛍̂)=iln(Γ(yi+φ))ln(Γ(φ))ln(yi!)+yiln(μ̂i)yiln(μ̂i+φ)+φln(φ)φln(μ̂i+φ).\begin{equation*} 2 \ln f(\mathbf{y} | \hat{\boldsymbol{\mu}}) = \sum_i \ln(\Gamma(y_i + \varphi)) - \ln(\Gamma(\varphi)) - \ln(y_i!) + y_i \ln(\hat{\mu}_i) - y_i \ln(\hat{\mu}_i + \varphi) + \varphi \ln(\varphi) - \varphi \ln(\hat{\mu}_i + \varphi). \end{equation*}

Thus the deviance is 2iyi[ln(yi)ln(yi+φ)ln(μ̂i)+ln(μ̂i+φ)]+φ[ln(yi+φ)+ln(μ̂i+φ)].\begin{equation*} 2 \sum_i y_i [\ln(y_i) - \ln(y_i + \varphi) - \ln(\hat{\mu}_i) + \ln(\hat{\mu}_i + \varphi)] + \varphi [ - \ln(y_i + \varphi) + \ln(\hat{\mu}_i + \varphi)]. \end{equation*}

Binomial Distribution

The binomial distribution is defined as f(y|μ,m)=(my)μy(1μ)my,\begin{equation*} f(y | \mu, m) = \binom{m}{y} \mu^y (1 - \mu)^{m - y}, \end{equation*} where mm is the (known) number of Bernoulli trials, yy is a non-negative integer, 0μ10 \le \mu \le 1, E(y)=mμ\text{E}(y) = m\mu, and Var(y)=mμ(1μ)\text{Var}(y) = m\mu (1 - \mu).

The log-likelihood (of a single observation) is defined as lnf(y|μ)=ln[(my)]+yln(μ)+(my)ln(1μ).\begin{equation*} \ln f(y | \mu) = \ln \left[ \binom{m}{y} \right] + y \ln(\mu) + (m - y) \ln(1 - \mu). \end{equation*}

Using the inverse logit link and writing in terms of μ=exp(w)/(1+exp(w))\mu = \exp(w) / (1 + \exp(w)), the log-likelihood can be written as lnf(y|μ)=ln[(my)]+yln(exp(w)/(1+exp(w)))+(my)ln(1exp(w)/(1+exp(w))).\begin{equation*} \ln f(y | \mu) = \ln \left[ \binom{m}{y} \right] + y \ln(\exp(w) / (1 + \exp(w))) + (m - y) \ln(1 - \exp(w) / (1 + \exp(w))). \end{equation*}

The derivative of the binomial distribution with respect to ww is ddwlnf(y|μ)=ymexp(w)1+exp(w)\begin{equation*} \frac{d}{d w} \ln f(y | \mu) = y - \frac{m \exp(w)}{1 + \exp(w)} \end{equation*}

The second derivative of the binomial distribution with respect to ww is d2dw2lnf(y|μ)=mexp(w)(1+exp(w))2.\begin{equation*} \frac{d^2}{d w^2} \ln f(y | \mu) = - \frac{m \exp(w)}{(1 + \exp(w))^2}. \end{equation*}

Twice the log-likelihood of the saturated model is 2lnf(𝐲|𝛍s)=iln[(miyi)]+yiln(yi)+(miyi)ln(miyi).\begin{equation*} 2 \ln f(\mathbf{y} | \boldsymbol{\mu}_s) = \sum_i \ln \left[ \binom{m_i}{y_i} \right] + y_i \ln(y_i) + (m_i - y_i) \ln(m_i - y_i). \end{equation*}

Twice the log-likelihood of the fitted (observed) model is 2lnf(𝐲|𝛍̂)=iln[(miyi)]+yiln(miμ̂i)+(miyi)ln(mimiμ̂i).\begin{equation*} 2 \ln f(\mathbf{y} | \hat{\boldsymbol{\mu}}) = \sum_i \ln \left[ \binom{m_i}{y_i} \right] + y_i \ln(m_i\hat{\mu}_i) + (m_i - y_i) \ln(m_i - m_i \hat{\mu}_i). \end{equation*}

Thus the deviance is 2iyi[ln(yi)ln(miμ̂i)]+(miyi)[ln(miyi)ln(mimiμ̂i)]\begin{equation*} 2 \sum_i y_i[ \ln(y_i) - \ln(m_i \hat{\mu}_i)] + (m_i - y_i)[ \ln(m_i - y_i) - \ln(m_i - m_i \hat{\mu}_i) ] \end{equation*}

Beta Distribution

The beta distribution is defined as f(y|μ,φ)=Γ(φ)Γ(μφ)Γ((1μ)φ)yμφ1(1y)(1μ)φ1,\begin{equation*} f(y | \mu, \varphi) = \frac{\Gamma(\varphi)}{\Gamma(\mu \varphi) \Gamma((1 - \mu)\varphi)} y^{\mu \varphi - 1} (1 - y)^{(1 - \mu)\varphi - 1}, \end{equation*} where 0<y<10 < y < 1, 0<μ<10 < \mu < 1, E(y)=μ\text{E}(y) = \mu, and Var(y)=μ(1μ)/(1+φ)\text{Var}(y) = \mu (1 - \mu) / (1 + \varphi).

The log-likelihood (of a single observation) is defined as lnf(y|μ)=ln(Γ(φ))ln(Γ(μφ))ln(Γ((1μ)φ))+(μφ1)ln(y)+((1μ)φ1)ln(1y).\begin{equation*} \ln f(y | \mu) = \ln(\Gamma(\varphi)) - \ln(\Gamma(\mu \varphi)) - \ln(\Gamma((1 - \mu)\varphi)) + (\mu \varphi - 1) \ln(y) + ((1 - \mu)\varphi - 1) \ln(1 - y). \end{equation*}

Using the inverse logit link and writing in terms of μ=exp(w)\mu = \exp(w), the log-likelihood can be written as lnf(y|μ,φ)=ln(Γ(φ))ln(Γ(exp(w)1+exp(w)φ))ln(Γ((1exp(w)1+exp(w))φ))+(exp(w)1+exp(w)φ1)ln(y)+((1exp(w)1+exp(w))φ1)ln(1y).\begin{equation*} \begin{split} \ln f(y | \mu, \varphi) & = \ln(\Gamma(\varphi)) - \ln(\Gamma(\frac{\exp(w)}{1 + \exp(w)} \varphi)) - \ln(\Gamma((1 - \frac{\exp(w)}{1 + \exp(w)})\varphi)) \\ & + (\frac{\exp(w)}{1 + \exp(w)} \varphi - 1) \ln(y) + ((1 - \frac{\exp(w)}{1 + \exp(w)})\varphi - 1) \ln(1 - y). \end{split} \end{equation*}

It can be shown that the derivative of beta distribution with respect to ww is ddwlnf(y|μ)=φexp(w)k0(w|y,φ)(1+exp(w))2,\begin{equation*} \frac{d}{d w} \ln f(y | \mu) = - \frac{\varphi \exp(w) k_0(w | y, \varphi)}{(1 + \exp(w))^2}, \end{equation*} where k0(w|y,φ)=ψ(0)(φexp(w)1+exp(w))ψ(0)(φ1+exp(w))+ln(1y1)k_0(w | y, \varphi) = \psi^{(0)}(\frac{\varphi \exp(w)}{1 + \exp(w)}) - \psi^{(0)}(\frac{\varphi}{1 + \exp(w)}) + \ln( \frac{1}{y} - 1) and ψ(0)\psi^{(0)} is the 0th0th derivative of the digamma function.

It can be shown that the second derivative of the beta distribution with respect to ww is d2dw2lnf(y|μ)=φexp(2w)k1(w|y,φ)(1+exp(w))4,\begin{equation*} \frac{d^2}{d w^2} \ln f(y | \mu) = - \frac{\varphi \exp(2w) k_1(w | y, \varphi)}{(1 + \exp(w))^4}, \end{equation*} where k1(w|y,φ)=φ[ψ(1)(φexp(w)1+exp(w))+ψ(1)(φ1+exp(w))]2sinh(w)[k0(w|y,φ)+2tanh1(12y)]k_1(w | y, \varphi) = \varphi [\psi^{(1)}(\frac{\varphi \exp(w)}{1 + \exp(w)}) + \psi^{(1)}( \frac{\varphi}{1 + \exp(w)})] - 2\sinh(w)[k_0(w | y, \varphi) + 2\tanh^{-1}(1 - 2y)] and ψ(n)\psi^{(n)} is the nthnth derivative of the digamma function.

Twice the log-likelihood of the saturated model is 2lnf(𝐲|𝛍s)=iln(Γ(φ))ln(Γ(yiφ))ln(Γ((1yi)φ))+(yiφ1)ln(yi)+((1yi)φ1)ln(1yi).\begin{equation*} 2 \ln f(\mathbf{y} | \boldsymbol{\mu}_s) = \sum_i \ln(\Gamma(\varphi)) - \ln(\Gamma(y_i \varphi)) - \ln(\Gamma((1 - y_i)\varphi)) + (y_i \varphi - 1) \ln(y_i) + ((1 - y_i)\varphi - 1) \ln(1 - y_i). \end{equation*}

Twice the log-likelihood of the fitted (observed) model is 2lnf(𝐲|𝛍̂)=ln(Γ(φ))ln(Γ(μ̂iφ))ln(Γ((1μ̂i)φ))+(μ̂iφ1)ln(yi)+((1μ̂i)φ1)ln(1yi).\begin{equation*} 2 \ln f(\mathbf{y} | \hat{\boldsymbol{\mu}}) = \ln(\Gamma(\varphi)) - \ln(\Gamma(\hat{\mu}_i \varphi)) - \ln(\Gamma((1 - \hat{\mu}_i)\varphi)) + (\hat{\mu}_i \varphi - 1) \ln(y_i) + ((1 - \hat{\mu}_i)\varphi - 1) \ln(1 - y_i). \end{equation*}

Thus the deviance is 2iln(Γ(yiφ))ln(Γ((1yi)φ))+ln(Γ(μ̂iφ))+ln(Γ((1μ̂i)φ))+(yiμ̂i)φln(yi)+(μ̂iyi)φln(1yi)\begin{equation*} 2 \sum_i - \ln(\Gamma(y_i \varphi)) - \ln(\Gamma((1 - y_i) \varphi )) + \ln(\Gamma(\hat{\mu}_i \varphi)) + \ln(\Gamma((1 - \hat{\mu}_i) \varphi )) + (y_i - \hat{\mu}_i) \varphi \ln(y_i) + (\hat{\mu}_i - y_i) \varphi \ln(1 - y_i) \end{equation*}

Sometimes the deviance contribution from the iith observation can be computationally unstable and yield a negative value (Espinheira, Ferrari, and Cribari-Neto 2008). This can happen, for example, when yiy_i is close to zero or one. When this happens, the deviance contribution is truncated to zero to reflect the fact that the theoretical deviance contribution must be non-negative.

Gamma Distribution

The gamma distribution is defined as f(y|μ,φ)=1Γ(φ)(φμ)φyφ1exp(yφμ),\begin{equation*} f(y | \mu, \varphi) = \frac{1}{\Gamma(\varphi)} \left( \frac{\varphi}{\mu} \right)^\varphi y^{\varphi - 1} \exp(\frac{-y \varphi}{\mu}), \end{equation*} where y>0y > 0, μ>0\mu > 0, E(y)=μ\text{E}(y) = \mu, and Var(y)=μ2/φ\text{Var}(y) = \mu^2/\varphi.

The log-likelihood (of a single observation) is defined as lnf(y|μ)=ln(Γ(φ))+φ[ln(φ)ln(μ)]+(φ1)ln(y)yφμ.\begin{equation*} \ln f(y | \mu) = - \ln(\Gamma(\varphi)) + \varphi [\ln(\varphi) - \ln(\mu)] + (\varphi - 1) \ln(y) - \frac{y \varphi}{\mu}. \end{equation*}

Using the inverse log link and writing in terms of μ=exp(w)\mu = \exp(w), the log-likelihood can be written as lnf(y|μ)=ln(Γ(φ))φ[ln(φ)w]+(φ1)ln(y)yφexp(w).\begin{equation*} \ln f(y | \mu) = - \ln(\Gamma(\varphi)) \varphi [\ln(\varphi) - w] + (\varphi - 1) \ln(y) - \frac{y \varphi}{\exp(w)}. \end{equation*}

The derivative of the gamma distribution with respect to ww is ddwlnf(y|μ)=φ+φyexp(w).\begin{equation*} \frac{d}{d w} \ln f(y | \mu) = -\varphi + \frac{\varphi y}{\exp(w)}. \end{equation*}

The second derivative of the gamma distribution with respect to ww is d2dw2lnf(y|μ)=φyexp(w).\begin{equation*} \frac{d^2}{d w^2} \ln f(y | \mu) = - \frac{\varphi y}{\exp(w)}. \end{equation*}

Twice the log-likelihood of the saturated model is 2lnf(𝐲|𝛍s)=ln(Γ(φ))+φ[ln(φ)ln(yi)]+(φ1)ln(yi)yiφyi.\begin{equation*} 2 \ln f(\mathbf{y} | \boldsymbol{\mu}_s) = - \ln(\Gamma(\varphi)) + \varphi [\ln(\varphi) - \ln(y_i)] + (\varphi - 1) \ln(y_i) - \frac{y_i \varphi}{y_i}. \end{equation*}

Twice the log-likelihood of the fitted (observed) model is 2lnf(𝐲|𝛍̂)=iyiln(Γ(φ))+φ[ln(φ)ln(μ̂i)]+(φ1)ln(yi)yiφμ̂i.\begin{equation*} 2 \ln f(\mathbf{y} | \hat{\boldsymbol{\mu}}) = \sum_i y_i - \ln(\Gamma(\varphi)) + \varphi [\ln(\varphi) - \ln(\hat{\mu}_i)] + (\varphi - 1) \ln(y_i) - \frac{y_i \varphi}{\hat{\mu}_i}. \end{equation*}

Thus the deviance is 2iln(yμ̂i)+yμ̂iμ̂i\begin{equation*} 2 \sum_i - \ln(\frac{y}{\hat{\mu}_i}) + \frac{y - \hat{\mu}_i}{\hat{\mu}_i} \end{equation*} after scaling by φ\varphi.

Inverse Gaussian Distribution

The inverse Gaussian distribution is defined as f(y|μ,φ)=φμ2πy3exp(φ(yμ2)2μy),\begin{equation*} f(y | \mu, \varphi) = \sqrt{\frac{\varphi \mu}{2 \pi y^3}} \exp \left( - \frac{\varphi (y - \mu^2)}{2 \mu y} \right), \end{equation*} where y>0y > 0, μ>0\mu > 0, E(y)=μ\text{E}(y) = \mu, and Var(y)=μ2/φ\text{Var}(y) = \mu^2/\varphi.

The log-likelihood (of a single observation) is defined as lnf(y|μ)=12[ln(φ/2πy3)+ln(μ)]φ(yμ)22μy.\begin{equation*} \ln f(y | \mu) = \frac{1}{2}[\ln(\varphi / 2 \pi y^3) + \ln(\mu)] - \varphi \frac{(y - \mu)^2}{2 \mu y}. \end{equation*}

Using the inverse log link and writing in terms of μ=exp(w)\mu = \exp(w), the log-likelihood can be written as lnf(y|μ)=12[ln(φ/2πy3)+w]φ(yexp(w))22exp(w)y.\begin{equation*} \ln f(y | \mu) = \frac{1}{2}[\ln(\varphi / 2 \pi y^3) + w] - \varphi \frac{(y - \exp(w))^2}{2 \exp(w) y}. \end{equation*}

The derivative of the gamma distribution with respect to ww is ddwlnf(y|μ)=φ(y2exp(w)exp(w)2y)+12.\begin{equation*} \frac{d}{d w} \ln f(y | \mu) = \varphi \left( \frac{y}{2 \exp(w)} - \frac{\exp(w)}{2y} \right) + \frac{1}{2}. \end{equation*}

The second derivative of the gamma distribution with respect to ww is d2dw2lnf(y|μ)=φ(exp(2w)+y2)2yexp(w).\begin{equation*} \frac{d^2}{d w^2} \ln f(y | \mu) = - \frac{\varphi(\exp(2w) + y^2)}{2y\exp(w)}. \end{equation*}

Note that this is not a typical parameterization of the inverse Gaussian distribution. The typical parameterization of the inverse Gaussian distribution is f(y|μ,λ)=λ2πy3exp(λ(yμ)22u2y),\begin{equation*} f(y | \mu, \lambda) = \sqrt{\frac{\lambda}{2 \pi y^3}} \exp \left( - \frac{\lambda(y - \mu)^2}{2u^2y} \right), \end{equation*} where y>0y > 0, μ>0\mu > 0, E(y)=μ\text{E}(y) = \mu, and Var(y)=μ3/λ\text{Var}(y) = \mu^3/\lambda, and λ=μφ\lambda = \mu \varphi.

Twice the log-likelihood of the saturated model is 2lnf(𝐲|𝛍s)=12(λ2πyi3)\begin{equation*} 2 \ln f(\mathbf{y} | \boldsymbol{\mu}_s) = \frac{1}{2} \left( \frac{\lambda}{2 \pi y_i^3} \right) \end{equation*}

Twice the log-likelihood of the fitted (observed) model is 2lnf(𝐲|𝛍̂)=12(λ2πyi3)λ(yiμ̂i)22μ̂i2yi.\begin{equation*} 2 \ln f(\mathbf{y} | \hat{\boldsymbol{\mu}}) = \frac{1}{2} \left( \frac{\lambda}{2 \pi y_i^3} \right) - \frac{\lambda (y_i - \hat{\mu}_i)^2}{2 \hat{\mu}_i^2 y_i}. \end{equation*}

Thus the deviance is i(yiμ̂i)2/(μ̂i2yi),\begin{equation*} \sum_i (y_i - \hat{\mu}_i)^2 / (\hat{\mu}_i^2 y_i), \end{equation*} after scaling by λ\lambda.

Table of Inverse Link Functions, did_i, and Di,iD_{i, i}

The following table contains a table of inverse link functions, did_i, and Di,iD_{i, i} for each spatial generalized linear model family. See more details for each family in the previous subsections.

A table of inverse link functions and relevant quantities for each spatial generalized linear model family.
Family u=g1(w)u = g^{-1}(w) did_i Di,iD_{i,i}
Poisson μ=exp(w)\mu = \exp(w) yiexp(wi)y_i - \exp(w_i) exp(wi)- \exp(w_i)
Negative Binomial μ=exp(w)\mu = \exp(w) φ(yiexp(wi))exp(wi)+φ\frac{\varphi (y_i - \exp(w_i))}{\exp(w_i) + \varphi} φexp(wi)(yi+φ)(φ+exp(wi))2- \frac{\varphi \exp(w_i) (y_i + \varphi)}{(\varphi + \exp(w_i))^2}
Binomial μ=exp(w)1+exp(w)\mu = \frac{\exp(w)}{1 + \exp(w)} yimiexp(wi)1+exp(wi)y_i - \frac{m_i \exp(w_i)}{1 + \exp(w_i)} miexp(wi)(1+exp(wi))2- \frac{m_i \exp(w_i)}{(1 + \exp(w_i))^2}
Beta μ=exp(w)1+exp(w)\mu = \frac{\exp(w)}{1 + \exp(w)} φexp(wi)k0(wi|yi,φ)(1+exp(wi))2- \frac{\varphi \exp(w_i) k_0(w_i | y_i, \varphi)}{(1 + \exp(w_i))^2} φexp(2wi)k1(wi|yi,φ)(1+exp(wi))4- \frac{\varphi \exp(2w_i) k_1(w_i | y_i, \varphi)}{(1 + \exp(w_i))^4}
Gamma μ=exp(w)\mu = \exp(w) φ+φyiexp(wi)-\varphi + \frac{\varphi y_i}{\exp(w_i)} φyiexp(wi)- \frac{\varphi y_i}{\exp(w_i)}
Inverse Gaussian μ=exp(w)\mu = \exp(w) φ(yi2exp(wi)exp(wi)2yi)+12\varphi \left( \frac{y_i}{2 \exp(w_i)} - \frac{\exp(w_i)}{2y_i} \right) + \frac{1}{2} φ(exp(2wi)+yi2)2yiexp(wi)- \frac{\varphi(\exp(2w_i) + y_i^2)}{2y_i\exp(w_i)}

The Empirical Semivariogram (esv())

The empirical semivariogram is a moment-based estimate of the theoretical semivariogram. The empirical semivariogram quantifies half of the average squared difference in the response among observations in several distance classes. More formally, the empirical semivariogram is defined as γ̂(h)=12|N(h)|N(h)(yiyj)2,\begin{equation}\label{eq:esv} \hat{\gamma}(h) = \frac{1}{2|N(h)|} \sum_{N(h)} (y_i - y_j)^2, \end{equation} where N(h)N(h) is the set of observations in 𝐲\mathbf{y} that are hh distance apart (distance classes) and |N(h)||N(h)| is the cardinality of N(h)N(h)(Cressie 1993). Often the set N(h)N(h) contains observations that are h±ch \pm c apart, where cc is some constant. This approach is known as “binning” the empirical semivariogram. The default in spmodel is to construct the semivariogram using 15 equally spaced bins where hh is contained in (0,hmax](0, h_{max}], and hmaxh_{max} is known as a “distance cutoff”. Distance cutoffs are commonly used when constructing γ̂(h)\hat{\gamma}(h) because there tend to be few pairs with large distances. The default in spmodel is to use a cutoff of half the maximum distance (hypotenuse) of the domain’s bounding box.

The main purpose of the empirical semivariogram is its use in semivariogram weighted least squares estimation for spatial linear models, though it can also be used as a visual diagnostic to assess the fit of a spatial covariance function.

A Note on Covariance Square Roots and Inverse Products

Often 𝚺1\boldsymbol{\Sigma}^{-1} is not strictly needed for estimation, prediction, or other purposes, but at least the product between 𝚺1\boldsymbol{\Sigma}^{-1} and some other matrix is needed. Consider the example of the covariance matrix of 𝛃̂\hat{\boldsymbol{\beta}} and observe 𝐗𝚺1𝐗\mathbf{X}^\top \boldsymbol{\Sigma}^{-1} \mathbf{X} is needed. The most direct way to find this product is certainly to obtain 𝚺1\boldsymbol{\Sigma}^{-1} and then multiply by 𝐗\mathbf{X}^\top on the left and 𝐗\mathbf{X} on the right. This is both computationally expensive and cannot be used to compute products that involve 𝚺1/2\boldsymbol{\Sigma}^{-1/2}, which are often useful. It is helpful to define 𝚺=𝐒𝐒\boldsymbol{\Sigma} = \mathbf{S} \mathbf{S}^\top for some matrix 𝐒\mathbf{S} and rewrite 𝐗𝚺1𝐗\mathbf{X}^\top \boldsymbol{\Sigma}^{-1} \mathbf{X} as 𝐗(𝐒)1𝐒1𝐗=(𝐒1𝐗)𝐒1𝐗\mathbf{X}^\top (\mathbf{S}^\top)^{-1} \mathbf{S}^{-1} \mathbf{X} = (\mathbf{S}^{-1} \mathbf{X})^\top \mathbf{S}^{-1} \mathbf{X}. Then one computes the inverse products by finding 𝐒\mathbf{S}.

One way to find 𝐒\mathbf{S} is to use an eigendecomposition. The eigendecomposition of 𝚺\boldsymbol{\Sigma} (which is real and symmetric) is given by 𝚺=𝐔𝐃𝐔,\begin{equation*} \boldsymbol{\Sigma} = \mathbf{U} \mathbf{D} \mathbf{U}^\top, \end{equation*} where 𝐔\mathbf{U} is an orthogonal matrix of eigenvectors of 𝚺\boldsymbol{\Sigma} and 𝐃\mathbf{D} is a diagonal matrix with eigenvalues of 𝚺\boldsymbol{\Sigma} on the diagonal. Then taking 𝐒=𝐔𝐃1/2\mathbf{S} = \mathbf{U}\mathbf{D}^{1/2} implies 𝐒1=𝐃1/2𝐔\mathbf{S}^{-1} = \mathbf{D}^{-1/2} \mathbf{U}^\top, which follows because 𝐔\mathbf{U} is orthonormal (𝐔1=𝐔\mathbf{U}^{-1} = \mathbf{U}^\top) and is straightforward to calculate as 𝐃1/2\mathbf{D}^{1/2} is diagonal. Also notice that 𝚺1/2=𝐔𝐃1/2𝐔\boldsymbol{\Sigma}^{1/2} = \mathbf{U} \mathbf{D}^{1/2} \mathbf{U}^\top, where 𝐃1/2\mathbf{D}^{1/2} is a diagonal matrix with square roots of eigenvalues of 𝚺\boldsymbol{\Sigma} on the diagonal. This result follows because 𝚺1/2𝚺1/2=𝐔𝐃1/2𝐔𝐔𝐃1/2𝐔=𝐔𝐃1/2(𝐔𝐔)𝐃1/2𝐔=𝐔𝐃𝐔=𝚺.\begin{equation*} \boldsymbol{\Sigma}^{1/2}\boldsymbol{\Sigma}^{1/2} = \mathbf{U} \mathbf{D}^{1/2} \mathbf{U}^\top \mathbf{U} \mathbf{D}^{1/2} \mathbf{U}^\top = \mathbf{U} \mathbf{D}^{1/2} (\mathbf{U}^\top \mathbf{U}) \mathbf{D}^{1/2} \mathbf{U}^\top = \mathbf{U} \mathbf{D} \mathbf{U}^\top = \boldsymbol{\Sigma}. \end{equation*} So not only does the eigendecomposition approach give us the inverse products, it also gives us 𝚺1/2\boldsymbol{\Sigma}^{1/2} and 𝚺1/2\boldsymbol{\Sigma}^{-1/2}. While straightforward, this approach is less efficient than the Cholesky decomposition (Golub and Van Loan 2013), which we discuss next.

The Cholesky decomposition decomposes 𝚺\boldsymbol{\Sigma} into the product between 𝐂\mathbf{C} and 𝐂\mathbf{C}^\top (𝚺=𝐂𝐂\boldsymbol{\Sigma} = \mathbf{C}\mathbf{C}^\top), where 𝐂\mathbf{C} is a lower triangular matrix. Note that 𝐂\mathbf{C} is generally not equal to 𝚺1/2\boldsymbol{\Sigma}^{1/2}. Taking 𝐒\mathbf{S} to be 𝐂\mathbf{C}, we see that finding the inverse products requires solving 𝐂1𝐗\mathbf{C}^{-1}\mathbf{X}. Observe that 𝐂1𝐗=𝐀\mathbf{C}^{-1}\mathbf{X} = \mathbf{A} for some matrix 𝐀\mathbf{A}. This implies 𝐗=𝐂𝐀\mathbf{X} = \mathbf{C}\mathbf{A}, which for 𝐀\mathbf{A} can be efficiently solved using forward substitution because 𝐂\mathbf{C} is lower triangular.

The products in this document that involve 𝚺1/2\boldsymbol{\Sigma}^{1/2} and 𝚺1\boldsymbol{\Sigma}^{-1} are generally implemented in spmodel using 𝐂\mathbf{C} and 𝐂1\mathbf{C}^{-1}. The products in this document that involve 𝚺1/2\boldsymbol{\Sigma}^{-1/2} still rely on an eigendecomposition (because recall that generally, 𝐂1𝐀𝚺1/2𝐀\mathbf{C}^{-1} \mathbf{A} \neq \boldsymbol{\Sigma}^{-1/2} \mathbf{A}). An example is computing the Pearson residuals.

A Note on Computational Stability

Spatial covariance matrices that have approximately no independent error variance (σie2\sigma^2_{ie}) can have unstable inverses. When this occurs, a small value can be added to the diagonal of the covariance matrix (via updating σie2\sigma^2_{ie}) to impose some computational stability. In spmodel, if σie2\sigma^2_{ie} is approximately zero, a small amount is added to the diagonal of the covariance matrix. Specifically, for spatial linear models, σie,up2=max(σie2,σde2/104)\sigma^2_{ie, up} = \text{max}(\sigma^2_{ie}, \sigma^2_{de}/10^4), where σie,up2\sigma^2_{ie, up} denotes an “updated” version of σie2\sigma^2_{ie}. For spatial generalized linear models, σie,up2=max(σie2,σde2/104,d)\sigma^2_{ie, up} = \text{max}(\sigma^2_{ie}, \sigma^2_{de}/10^4, d), where d=max(1/104,s2/104)d = \text{max}(1/10^4, s^2/10^4), where s2s^2 is the sample variance of a linear regression of ln(𝐲+1)\ln(\mathbf{y} + 1) on 𝐗\mathbf{X}. This value of σie,up2\sigma^2_{ie, up} is also added to the diagonal of 𝐗𝚺1𝐗+𝐗𝚺1(𝐃𝚺1)1𝚺1𝐗\mathbf{X}^\top \boldsymbol{\Sigma}^{-1} \mathbf{X} + \mathbf{X}^\top \boldsymbol{\Sigma}^{-1} (\mathbf{D} - \boldsymbol{\Sigma}^{-1})^{-1} \boldsymbol{\Sigma}^{-1} \mathbf{X}, used via the Sherman-Morrison-Woodbury formula required to efficiently find the log determinant and inverse of the Hessian, 𝐆\mathbf{G}, when using spatial indexing for big data. For more on stability of spatial covariance matrices, see Diamond and Armstrong (1984), Posa (1989), O’Dowd (1991), Ababou, Bagtzoglou, and Wood (1994), Booker et al. (1999), Martin and Simpson (2005), Bivand, Pebesma, and Gomez-Rubio (2013), and Ver Hoef (2018), among others.

References

Ababou, Rachid, Amvrossios C Bagtzoglou, and Eric F Wood. 1994. “On the Condition Number of Covariance Matrices in Kriging, Estimation, and Simulation of Random Fields.” Mathematical Geology 26: 99–133.
Anselin, Luc, Ibnu Syabri, and Youngihn Kho. 2010. “GeoDa: An Introduction to Spatial Data Analysis.” In Handbook of Applied Spatial Analysis, 73–89. Springer.
Bivand, Roger S., Edzer Pebesma, and Virgilio Gomez-Rubio. 2013. Applied Spatial Data Analysis with R, Second Edition. Springer, NY. https://asdar-book.org/.
Booker, Andrew J, John E Dennis, Paul D Frank, David B Serafini, Virginia Torczon, and Michael W Trosset. 1999. “A Rigorous Framework for Optimization of Expensive Functions by Surrogates.” Structural Optimization 17: 1–13.
Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32.
Brent, Richard P. 1971. “An Algorithm with Guaranteed Convergence for Finding a Zero of a Function.” The Computer Journal 14 (4): 422–25.
Cook, R Dennis. 1979. “Influential Observations in Linear Regression.” Journal of the American Statistical Association 74 (365): 169–74.
Cook, R Dennis, and Sanford Weisberg. 1982. Residuals and Influence in Regression. New York: Chapman; Hall.
Cressie, Noel. 1985. “Fitting Variogram Models by Weighted Least Squares.” Journal of the International Association for Mathematical Geology 17 (5): 563–86.
———. 1993. Statistics for Spatial Data. John Wiley & Sons.
Cribari-Neto, Francisco, and Achim Zeileis. 2010. “Beta Regression in R.” Journal of Statistical Software 34 (2): 1–24. https://doi.org/10.18637/jss.v034.i02.
Curriero, Frank C, and Subhash Lele. 1999. “A Composite Likelihood Approach to Semivariogram Estimation.” Journal of Agricultural, Biological, and Environmental Statistics, 9–28.
Diamond, Phil, and Margaret Armstrong. 1984. “Robustness of Variograms and Conditioning of Kriging Matrices.” Journal of the International Association for Mathematical Geology 16: 809–22.
Espinheira, Patrícia L, Silvia LP Ferrari, and Francisco Cribari-Neto. 2008. “On Beta Regression Residuals.” Journal of Applied Statistics 35 (4): 407–19.
Fawcett, Tom. 2006. “An Introduction to ROC Analysis.” Pattern Recognition Letters 27 (8): 861–74.
Fox, Eric W, Jay M Ver Hoef, and Anthony R Olsen. 2020. “Comparing Spatial Regression to Random Forests for Large Environmental Data Sets.” PloS One 15 (3): e0229509.
Goldman, Nick, and Simon Whelan. 2000. “Statistical Tests of Gamma-Distributed Rate Heterogeneity in Models of Sequence Evolution in Phylogenetics.” Molecular Biology and Evolution 17 (6): 975–78.
Golub, Gene H, and Charles F Van Loan. 2013. Matrix Computations. JHU press.
Hanley, James A, and Barbara J McNeil. 1982. “The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve.” Radiology 143 (1): 29–36.
Harville, David A. 1977. “Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems.” Journal of the American Statistical Association 72 (358): 320–38.
Harville, David A, and Daniel R Jeske. 1992. “Mean Squared Error of Estimation or Prediction Under a General Linear Model.” Journal of the American Statistical Association 87 (419): 724–31.
Henderson, Charles R. 1975. “Best Linear Unbiased Estimation and Prediction Under a Selection Model.” Biometrics, 423–47.
Hoeting, Jennifer A, Richard A Davis, Andrew A Merton, and Sandra E Thompson. 2006. “Model Selection for Geostatistical Models.” Ecological Applications 16 (1): 87–98.
Hrong-Tai Fai, Alex, and Paul L Cornelius. 1996. “Approximate f-Tests of Multiple Degree of Freedom Hypotheses in Generalized Least Squares Analyses of Unbalanced Split-Plot Experiments.” Journal of Statistical Computation and Simulation 54 (4): 363–78.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
Kackar, Raghu N, and David A Harville. 1984. “Approximations for Standard Errors of Estimators of Fixed and Random Effects in Mixed Linear Models.” Journal of the American Statistical Association 79 (388): 853–62.
Kenward, Michael G, and James H Roger. 1997. “Small Sample Inference for Fixed Effects from Restricted Maximum Likelihood.” Biometrics, 983–97.
———. 2009. “An Improved Approximation to the Precision of Fixed Effects from Restricted Maximum Likelihood.” Computational Statistics & Data Analysis 53 (7): 2583–95.
Littell, Ramon C, George A Milliken, Walter W Stroup, Russell D Wolfinger, and Schabenberber Oliver. 2006. SAS for Mixed Models. SAS publishing.
MacQueen, J. 1967. “Classification and Analysis of Multivariate Observations.” In 5th Berkeley Symp. Math. Statist. Probability, 281–97. University of California Los Angeles LA USA.
Martin, Jay D, and Timothy W Simpson. 2005. “Use of Kriging Models to Approximate Deterministic Computer Models.” AIAA Journal 43 (4): 853–63.
Meinshausen, Nicolai, and Greg Ridgeway. 2006. “Quantile Regression Forests.” Journal of Machine Learning Research 7 (6).
Montgomery, Douglas C, Elizabeth A Peck, and G Geoffrey Vining. 2021. Introduction to Linear Regression Analysis. John Wiley & Sons.
Muschelli III, John. 2020. “ROC and AUC with a Binary Predictor: A Potentially Misleading Metric.” Journal of Classification 37 (3): 696–708.
Myers, Raymond H, Douglas C Montgomery, G Geoffrey Vining, and Timothy J Robinson. 2012. Generalized Linear Models: With Applications in Engineering and the Sciences. John Wiley & Sons.
Nelder, John A, and Roger Mead. 1965. “A Simplex Method for Function Minimization.” The Computer Journal 7 (4): 308–13.
O’Dowd, RJ. 1991. “Conditioning of Coefficient Matrices of Ordinary Kriging.” Mathematical Geology 23: 721–39.
Patterson, Desmond, and Robin Thompson. 1971. “Recovery of Inter-Block Information When Block Sizes Are Unequal.” Biometrika 58 (3): 545–54.
Pinheiro, José, and Douglas Bates. 2006. Mixed-Effects Models in s and s-PLUS. Springer science & business media.
Posa, D. 1989. “Conditioning of the Stationary Kriging Matrices for Some Well-Known Covariance Models.” Mathematical Geology 21: 755–65.
Prasad, NG Narasimha, and Jon NK Rao. 1990. “The Estimation of the Mean Squared Error of Small-Area Estimators.” Journal of the American Statistical Association 85 (409): 163–71.
Robin, Xavier, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez, and Markus Müller. 2011. “pROC: An Open-Source Package for r and s+ to Analyze and Compare ROC Curves.” BMC Bioinformatics 12: 77.
Satterthwaite, Franklin E. 1946. “An Approximate Distribution of Estimates of Variance Components.” Biometrics Bulletin 2 (6): 110–14.
Schluchter, Mark D, and Janet T Elashoff. 1990. “Small-Sample Adjustments to Tests with Unbalanced Repeated Measures Assuming Several Covariance Structures.” Journal of Statistical Computation and Simulation 37 (1-2): 69–87.
Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics, 461–64.
Searle, Shayle R, George Casella, and Charles E McCulloch. 2009. Variance Components. John Wiley & Sons.
Self, Steven G, and Kung-Yee Liang. 1987. “Asymptotic Properties of Maximum Likelihood Estimators and Likelihood Ratio Tests Under Nonstandard Conditions.” Journal of the American Statistical Association 82 (398): 605–10.
Sherman, Jack. 1949. “Adjustment of an Inverse Matrix Corresponding to Changes in the Elements of a Given Column or a Given Row of the Original Matrix.” Annals of Mathematical Statistics 20 (4): 621.
Sherman, Jack, and Winifred J Morrison. 1950. “Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix.” The Annals of Mathematical Statistics 21 (1): 124–27.
Stram, Daniel O, and Jae Won Lee. 1994. “Variance Components Testing in the Longitudinal Mixed Effects Model.” Biometrics, 1171–77.
Tobler, Waldo R. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” Economic Geography 46 (sup1): 234–40.
Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with s. Fourth. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS4/.
Ver Hoef, Jay M. 2012. “Who Invented the Delta Method?” The American Statistician 66 (2): 124–27.
———. 2018. “Kriging Models for Linear Networks and Non-Euclidean Distances: Cautions and Solutions.” Methods in Ecology and Evolution 9 (6): 1600–1613.
Ver Hoef, Jay M, Eryn Blagg, Michael Dumelle, Philip M Dixon, Dale L Zimmerman, and Paul Conn. 2023. “Marginal Inference for Hierarchical Generalized Linear Mixed Models with Patterned Covariance Matrices Using the Laplace Approximation.” arXiv Preprint arXiv:2305.02978.
Ver Hoef, Jay M, Michael Dumelle, Matt Higham, Erin E Peterson, and Daniel J Isaak. 2023. “Indexing and Partitioning the Spatial Linear Model for Large Data Sets.” arXiv Preprint arXiv:2305.07811.
Ver Hoef, Jay M, Erin E Peterson, Mevin B Hooten, Ephraim M Hanks, and Marie-Josèe Fortin. 2018. “Spatial Autoregressive Models for Statistical Inference from Ecological Data.” Ecological Monographs 88 (1): 36–59.
Wolf, Helmut. 1978. “The Helmert Block Method-Its Origin and Development.” In Proceedings of the Second International Symposium on Problems Related to the Redefinition of North American Geodetic Networks,(NOAA, Arlington-Va, 1978), 319–26.
Wolfinger, Russ, Randy Tobias, and John Sall. 1994. “Computing Gaussian Likelihoods and Their Derivatives for General Linear Mixed Models.” SIAM Journal on Scientific Computing 15 (6): 1294–1310.
Woodbury, Max A. 1950. Inverting Modified Matrices. Department of Statistics, Princeton University.
Wright, Marvin N, and Andreas Ziegler. 2015. “Ranger: A Fast Implementation of Random Forests for High Dimensional Data in c++ and r.” arXiv Preprint arXiv:1508.04409.
Zimmerman, Dale L, and Jay M Ver Hoef. 2024. Spatial Linear Models for Environmental Data. CRC Press.