vignettes/sampling.Rmd
sampling.Rmd
If you have yet not read the “Start Here” vignette, please do so by running
vignette("start-here", "spsurvey")
The Generalized Random Tessellation Stratified (GRTS) algorithm
(Stevens and Olsen, 2004, Olsen et.al., 2012) is a spatially balanced
sampling algorithm available in spsurvey. The GRTS algorithm is used to
sample from finite populations (point resources – e.g. lakes treated as
a whole) and infinite populations (linear resources – e.g. rivers and
streams; areal resources – e.g. wetlands and coastlines) and is
implemented in spsurvey using the grts()
function. The
output from grts()
contains the design sites and additional
information about the sampling design. More specifically, it is a list
with five elements:
sites_legacy
: legacy sites included in the base (main)
sample (see Legacy sampling)sites_base
: sites included in the base (main) sample
that are not legacy sitessites_over
: replacement sites obtained using reverse
hierarchical ordering (see Reverse hierarchical
ordering)sites_near
: replacement sites obtained using nearest
neighbor (see Nearest neighbor)design
: a list detailing the sampling designWe use the NE_Lakes
sampling frame in spsurvey to
introduce spatially balanced sampling via the GRTS algorithm.
NE_Lakes
is a finite sampling frame and contains lakes
(treated as a whole) from the Northeastern United States. There are a
few variables in NE_Lakes
you will throughout this
vignette:
AREA
: lake area in hectaresAREA_CAT
: lake area categories (small and large)ELEV_CAT
: lake elevation categories (low and high)Before proceeding, we load spsurvey and set a reproducible seed by running
To select an unstratified GRTS sample where each site in the sampling frame has an equal probability of selection (i.e. inclusion probability), run
eqprob <- grts(NE_Lakes, n_base = 50)
The first argument to grts()
is the sampling frame,
which must be an sf
object. The second argument is
n_base
, which specifies the number of sites in the base
(main) sample. The sites_base
object in eqprob
is an sf
object and contains the original columns of
NE_Lakes
as well as a few additional columns such as a site
identifier, latitude and longitude coordinates, inclusion probabilities,
and design weights (to be used in analyses after collecting data).
To print a summary of site counts, run
eqprob
#> Summary of Site Counts:
#>
#> total siteuse
#> total:50 Base:50
To visualize the design sites overlain onto the sampling frame, run
The key.width
argument extends the plot’s margin to fit
the legend text nicely within the plot. sp_plot()
can
equivalently be used in place of plot()
(sp_plot()
is currently maintained for backwards
compatibility with previous spsurvey versions).
To select an unstratified GRTS sample where each site in the sampling frame has unequal inclusion probabilities according to some categorical variable, run
caty_n <- c(small = 40, large = 10)
uneqprob <- grts(
NE_Lakes,
n_base = 50,
caty_var = "AREA_CAT",
caty_n = caty_n
)
uneqprob
#> Summary of Site Counts:
#>
#> siteuse by total:
#> Base
#> total 50
#>
#> siteuse by caty:
#> Base
#> large 10
#> small 40
caty_var
is the unequal inclusion probability variable
("AREA_CAT"
in NE_Lakes
), and
caty_n
is a vector whose names are the levels in
caty_var
and whose values are the expected sample sizes in
each category (the sum of these expected samples sizes must equal
n_base
). In this sample, inclusion probabilities are
adjusted so that on average, there are 40 small lakes and 10 large lakes
selected.
To select an unstratified GRTS sample where each site in the sampling frame has inclusion probability proportional to a positive, continuous variable, run
propprob <- grts(
NE_Lakes,
n_base = 50,
aux_var = "AREA"
)
aux_var
is the proportional probability (auxiliary)
variable ("AREA"
in NE_Lakes
). Proportional
(to size) inclusion probabilities are useful because they can increase
the precision of estimates when the response variable is positively
correlated with the proportional probability variable.
Stratified sampling designs partition the sampling frame into distinct groups called strata and sites are selected within each stratum independently of other strata. Stratified sampling is useful from a practical perspective because it allows for stratum-specific sample sizes and implementation practices (e.g. each stratum may have different sampling protocols). Stratified sampling is useful from a statistical perspective because estimates from stratified samples tend to be more precise than estimates from unstratified samples.
To select a sample stratified by elevation category with equal inclusion probabilities in each stratum, run
strata_n <- c(low = 25, high = 15)
strat_eqprob <- grts(NE_Lakes, n_base = strata_n, stratum_var = "ELEV_CAT")
strat_eqprob
#> Summary of Site Counts:
#>
#> siteuse by total:
#> Base
#> total 40
#>
#> siteuse by stratum:
#> Base
#> high 15
#> low 25
strata_n
is a named vector whose names represent the
strata and whose values represent strata-specific sample sizes, and
stratum_var
is the stratification variable
(ELEV_CAT
in NE_Lakes
).
To visualize the design sites overlain onto the sampling frame (separately for each stratum), run
To select a sample stratified by elevation category with unequal inclusion probabilities for each area category, run
caty_n <- list(
low = c(small = 20, large = 5),
high = c(small = 10, large = 5)
)
strat_uneqprob <- grts(
NE_Lakes,
n_base = strata_n,
stratum_var = "ELEV_CAT",
caty_var = "AREA_CAT",
caty_n = caty_n
)
strat_uneqprob
#> Summary of Site Counts:
#>
#> siteuse by total:
#> Base
#> total 40
#>
#> siteuse by stratum:
#> Base
#> high 15
#> low 25
#>
#> siteuse by caty:
#> Base
#> large 9
#> small 31
#>
#> siteuse by stratum:caty:
#> Base
#> high:large 3
#> low:large 6
#> high:small 12
#> low:small 19
caty_n
is now a list: the first element contains the
expected sample sizes for each area category in the low elevation
stratum, and the second element contains the expected sample sizes for
each area category in the high elevation stratum.
To select a sample stratified by elevation category with probabilities proportional to lake area, run
strat_propprob <- grts(
NE_Lakes,
n_base = strata_n,
stratum_var = "ELEV_CAT",
aux_var = "AREA"
)
Legacy (historical) sites are sites that were selected from a
previous sampling design that incorporated randomness into site
selection, are part of the sampling frame for the current sampling
design, and should be always be selected in the current sample. The
NE_Lakes_Legacy
data contains some legacy sites. To
accommodate these legacy sites while sampling, use the
legacy_sites
argument:
legacy <- grts(NE_Lakes, n_base = 50, legacy_sites = NE_Lakes_Legacy)
legacy
#> Summary of Site Counts:
#>
#> total siteuse
#> total:50 Legacy: 5
#> Base :45
To visualize the legacy and base sites together, run
These points can be overlain onto the sampling frame by including
NE_Lakes
in the plot()
command.
Legacy sites are included in the base (main) sample, so the value for
n_base
should be equal to the number of legacy sites plus
the number of non-legacy sites desired in the sample. If your sampling
design uses stratification, unequal inclusion probabilities, or
proportional inclusion probabilities, you need to name the column in
legacy_sites
that represents these values. By default,
grts()
assumes that the respective columns in
sframe
and legacy_sites
share the same name –
if this is not the case, use the legacy_stratum_var
,
legacy_caty_var
or legacy_aux_var
arguments.
If your population is finite, you may alternatively accommodate legacy
sites by including a variable in your sampling frame that indicates
whether each site is a legacy site or not and then use the
legacy_var
argument in grts()
. When using
legacy_var
, you do not need to use
legacy_stratum_var
, legacy_caty_var
or
legacy_aux_var
.
Another legacy site example is provided in Infinite Population Sampling.
Though the GRTS algorithm selects spatially balanced samples, the algorithm can select sites that are closer together than you may desire. To enforce a minimum distance between sites, run
mindis <- grts(NE_Lakes, n_base = 50, mindis = 1600)
Here we have specified that sites be no closer together than 1600
meters (meters are the units of the sf
object). In some
situations, the grts()
function will fail to enforce the
minimum distance requirement for all sites. When this occurs, the
function will enforce the requirement for as many sites as possible and
then return a warning.
When stratifying, mindis
will apply to all strata if it
is a single value. It is possible to set stratum-specific minimum
distance requirements by storing them in a named list (the names must
match the strata):
Sometimes data cannot be collected at a site. This may occur for a
variety of reasons, some of which include landowner denial or
inaccessible sites. The grts()
function has two options for
selecting replacement sites, which are sites available to replace sites
in the base (main) sample for which data cannot be collected.
The first replacement site option is reverse hierarchical ordering (Stevens and Olsen, 2004). To select a base sample with reverse hierarchically ordered replacement sites, run
rho_replace <- grts(NE_Lakes, n_base = 50, n_over = 25)
rho_replace
#> Summary of Site Counts:
#>
#> total siteuse
#> total:75 Base:50
#> Over:25
n_base
indicates the desired sample size, and
n_over
indicates the number of replacement sites. Sites are
first selected using the GRTS algorithm for a sample size of
n_base + n_over
. They are then determined as base sites or
replacement sites in a way that preserves as much spatial balance as
possible. The spatial balance of the base sites degrades as
n_over
increases, however, so it is important to choose a
realistic value for n_over
.
To visualize the base sites and reverse hierarchically ordered replacement sites, run
When stratifying, n_over
will apply to all strata if it
is a single value. It is possible to set stratum-specific reverse
hierarchical ordering requirements by storing them in a named list (the
names must match the strata):
The second replacement site option is nearest neighbor. To select a base sample with nearest neighbor replacement sites, run
nn_replace <- grts(NE_Lakes, n_base = 50, n_near = 1)
nn_replace
#> Summary of Site Counts:
#>
#> total siteuse
#> total:100 Base:50
#> Near:50
n_base
indicates the desired sample size, and
n_near
indicates the number of replacement sites for each
base site. For n_near = 1
, each site in the base sample has
a replacement site associated with it – this replacement site is the
closest site (measured by Euclidean distance) to the base site (within a
stratum).
When stratifying, n_near
will apply to all strata if it
is a single value. It is possible to set stratum-specific nearest
neighbor requirements by storing them in a named list (the names must
match the strata):
If you want to select samples that are not spatially balanced, use
spsurvey’s irs()
function (IRS is for Independent Random
Sampling). The function arguments for the irs()
function
are the same as the function arguments for the grts()
function. This means that the flexible sampling design options available
for the GRTS algorithm are also available for the IRS algorithm. To
select an unstratified IRS sample where each site in the sampling frame
has an equal inclusion probability, run
eqprob_irs <- irs(NE_Lakes, n_base = 50)
eqprob_irs
#> Summary of Site Counts:
#>
#> total siteuse
#> total:50 Base:50
To visualize the design sites, run
Notice how these IRS design sites appear less spread out than the design sites from the unstratified GRTS sample with equal inclusion probabilities.
To select an IRS sample stratified by elevation category with equal inclusion probabilities in each stratum, run
So far we have talked generally about spatial balance but not about how to measure it. Stevens and Olsen (2004) proposed measuring spatial balance using Voronoi polygons. A design site’s Voronoi polygon contains all sites in the sampling frame closer to this design site than any other design site. To measure spatial balance, these Voronoi polygons are combined with the inclusion probabilities of all sites in the sampling frame and a loss metric.
Several spatial balance metrics are available in spsurvey. The default is Pielou’s evenness index. Pielou’s evenness index ranges from 0 (perfect spatial balance) to 1 (a complete lack of spatial balance). Larger values of the Pielou’s evenness index indicate worse spatial balance. To measure the spatial balance of the unstratified GRTS and IRS samples with equal inclusion probabilities, run
sp_balance(eqprob$sites_base, NE_Lakes) # grts
#> stratum metric value
#> 1 None pielou 0.02585831
sp_balance(eqprob_irs$sites_base, NE_Lakes) # irs
#> stratum metric value
#> 1 None pielou 0.03823327
The GRTS design sites are more spatially balanced than the IRS design sites.
In a stratified sample, spatial balance is calculated separately for each stratum:
sp_balance(strat_eqprob$sites_base, NE_Lakes, stratum_var = "ELEV_CAT") # grts
#> stratum metric value
#> 1 low pielou 0.03680620
#> 2 high pielou 0.04543609
sp_balance(strat_eqprob_irs$sites_base, NE_Lakes, stratum_var = "ELEV_CAT") # irs
#> stratum metric value
#> 1 low pielou 0.0415444
#> 2 high pielou 0.1171086
The GRTS design sites are more spatially balanced than the IRS design sites in both strata.
Though this vignette used a finite population (NE_Lakes
)
to illustrate the sampling options available in spsurvey, the same
commands are used to sample from infinite populations. The
Illinois_River
sf
object in spsurvey is a
linear resource and the Lake_Ontario
sf
object
in spsurvey is an areal resource – both are examples of infinite
populations. To select a GRTS sample with equal inclusion probabilities
from Illinois_River
, run
eqprob <- grts(Illinois_River, n_base = 50)
To visualize the design sites overlain onto the sampling frame, run
To accommodate the Illinois River legacy sites, run:
legacy <- grts(Illinois_River, n_base = 50, legacy_sites = Illinois_River_Legacy)
To select a GRTS sample with equal inclusion probabilities from
Lake_Ontario
, run
eqprob <- grts(Lake_Ontario, n_base = 50)
To visualize the design sites (with closed circles) overlain onto the sampling frame, run
To bind sites_legacy
, sites_base
,
sites_over
, sites_near
(four separate
sf
objects) into a single sf
object, use
sp_rbind()
. For example, to combine the base and reverse
hierarchically ordered replacement sites from rho_replace
into a single sf
object, run
combined <- sp_rbind(rho_replace)
Then it is straightforward to write out a single sf
object using a function like sf::write_sf()
. For example,
to save the combined sf
object as a shapefile named
"file_name.shp"
at the location on your machine called
"file_path"
, run
write_sf(combined, "file_path/file_name.shp")
Olsen, A. R., T. M. Kincaid and Q. Payton (2012). Spatially balanced survey designs for natural resources. Design and Analysis of Long-Term Ecological Monitoring Studies. R. A. Gitzen, J. J. Millspaugh, A. B. Cooper and D. S. Licht. Cambridge, UK, Cambridge University Press: 126-150.
Stevens Jr, D. L. and Olsen, A. R. (2004). Spatially balanced sampling of natural resources. Journal of the American Statistical Association, 99(465):262-278.