Select an independent random sample (IRS)

Select a sample that is not spatially balanced from a point (finite), linear / linestring (infinite), or areal / polygon (infinite) sampling frame using the Independent Random Sampling (IRS) algorithm. The IRS algorithm accommodates unstratified and stratified sampling designs and allows for equal inclusion probabilities, unequal inclusion probabilities according to a categorical variable, and inclusion probabilities proportional to a positive auxiliary variable. Several additional sampling options are included, such as including legacy (historical) sites, requiring a minimum distance between sites, and selecting replacement sites.

irs(
  sframe,
  n_base,
  stratum_var = NULL,
  seltype = NULL,
  caty_var = NULL,
  caty_n = NULL,
  aux_var = NULL,
  legacy_var = NULL,
  legacy_sites = NULL,
  legacy_stratum_var = NULL,
  legacy_caty_var = NULL,
  legacy_aux_var = NULL,
  mindis = NULL,
  maxtry = 10,
  n_over = NULL,
  n_near = NULL,
  wgt_units = NULL,
  pt_density = NULL,
  DesignID = "Site",
  SiteBegin = 1,
  sep = "-",
  projcrs_check = TRUE
)

Arguments

sframe: A sampling frame as an sf object. The coordinate system for sframe must projected (not geographic). If m or z values are in sframe's geometry, they are silently dropped (i.e., only x-coordinates and y-coordinates are preserved).
n_base: The base sample size required. If the sampling design is unstratified, this is a single numeric value. If the sampling design is stratified, this is a named vector or list whose names represent each stratum and whose values represent each stratum's sample size. These names must match the values of the stratification variable represented by stratum_var. Legacy sites are considered part of the base sample, so the value for n_base should be equal to the number of legacy sites plus the number of desired non-legacy sites.
stratum_var: A character string containing the name of the column from sframe that identifies stratum membership for each element in sframe. If stratum equals NULL, the sampling design is unstratified and all elements in sframe are eligible to be selected in the sample. The default is NULL.
seltype: A character string or vector indicating the inclusion probability type, which must be one of following: "equal" for equal inclusion probabilities; "unequal" for unequal inclusion probabilities according to a categorical variable specified by caty_var; and "proportional" for inclusion probabilities proportional to a positive auxiliary variable specified by aux_var. If the sampling design is unstratified, seltype is a single character vector. If the sampling design is stratified, seltype is a named vector whose names represent each stratum and whose values represent each stratum's inclusion probability type. seltype's default value tries to match the intended inclusion probability type: If caty_var and aux_var are not specified, seltype is "equal"; if caty_var is specified, seltype is "unequal"; and if aux_var is specified, seltype is "proportional".
caty_var: A character string containing the name of the column from sframe that represents the unequal probability variable.
caty_n: A character vector indicating the expected sample size for each level of caty_var, the unequal probability variable. If the sampling design is unstratified, caty_n is a named vector whose names represent each level of caty_var and whose values represent each level's expected sample size. The sum of caty_n must equal n_base. If the sampling design is stratified and the expected sample sizes are the same among strata, caty_n is a named vector whose names represent represent each level of caty_var and whose values represent each level's expected sample size -- these expected sample sizes are applied to all strata. The sum of caty_n must equal each stratum's value in n_base. If the sampling design is stratified and the expected sample sizes differ among strata, caty_n is a list where each element is named as a stratum in n_base. Each stratum's list element is a named vector whose names represent each level of caty_var and whose values represent each level's expected sample size (within the stratum). The sum of the values in each stratum's list element must equal that stratum's value in n_base.
aux_var: A character string containing the name of the column from sframe that represents the proportional (to size) inclusion probability variable (auxiliary variable). This auxiliary variable must be positive, and the resulting inclusion probabilities are proportional to the values of the auxiliary variable. Larger values of the auxiliary variable result in higher inclusion probabilities.
legacy_var: This argument can be used instead of legacy_sites when sframe is a POINT or MULTIPOINT geometry (i.e. a finite sampling frame), When legacy_var is used, it is a character string containing the name of the column from sframe that represents whether each site is a legacy site. For legacy sites, the values of the legacy_var must contain character strings that act as a legacy site identifier. For non-legacy sites, the values of the legacy_var column must be NA. Using this approach, legacy_stratum_var, legacy_caty_var, and legacy_aux_var are not required and should not be used (because legacy_var represents a column in sframe). spsurvey assumes that the legacy sites were selected from a previous sampling design that incorporated randomness into site selection and that the legacy sites are elements of the current sampling frame.
legacy_sites: An sf object with a POINT or MULTIPOINT geometry representing the legacy sites. spsurvey assumes that the legacy sites were selected from a previous sampling design that incorporated randomness into site selection and that the legacy sites are elements of the current sampling frame. If sframe has a POINT or MULTIPOINT geometry, the observations in legacy_sites should not also be in sframe (i.e., duplicates are not removed). Thus, sframe and legacy_sites together compose the current sampling frame. If m or z values are in legacy_sites' geometry, they are silently dropped (i.e., only x-coordinates and y-coordinates are preserved).
legacy_stratum_var: A character string containing the name of the column from legacy_sites that identifies stratum membership for each element of legacy_sites. This argument is required when the sampling design is stratified and its levels must be contained in the levels of the stratum_var variable. The default value of legacy_stratum_var is stratum_var, so legacy_stratum_var need only be specified explicitly when the name of the stratification variable in legacy_sites differs from stratum_var.
legacy_caty_var: A character string containing the name of the column from legacy_sites that identifies the unequal probability variable for each element of legacy_sites. This argument is required when the sampling design uses unequal selection probabilities and its categories must be contained in the levels of the caty_var variable. The default value of legacy_caty_var is caty_var, so legacy_caty_var need only be specified explicitly when the name of the unequal probability variable in legacy_sites differs from caty_var.
legacy_aux_var: A character string containing the name of the column from legacy_sites that identifies the proportional probability variable for each element of legacy_sites. This argument is required when the sampling design uses proportional selection probabilities and the values of the legacy_aux_var variable must be positive. The default value of legacy_aux_var is aux_var, so legacy_aux_var need only be specified explicitly when the name of the proportional probability variable in legacy_sites differs from aux_var.
mindis: A numeric value indicating the desired minimum distance between sampled sites. If the sampling design is stratified and mindis is an numeric value, the minimum distance is applied to all strata. If the sampling design is stratified and different minimum distances are desired among strata, then mindis is a list whose names match the names of n_base and whose and values are the minimum distance for the corresponding stratum. If a minimum distance is not desired for a particular stratum, then the corresponding value in mindis should be 0 or NULL (which is equivalent to 0). The units of mindis must represent the units in sframe. A warning is returned if the minimum distance could not be reached after maxtry attempts. If legacy sites are used, the minimum distance requirement (and subsequent warning if maxtry attempts are reached) is enforced for all base sites that are not legacy sites (i.e., the minimum distance is enforced for these sites by comparing distances against all base sites (legacy and non-legacy)).
maxtry: The number of maximum attempts to apply the minimum distance algorithm to obtain the desired minimum distance between sites. Each iteration takes roughly as long as the standard GRTS algorithm. Successive iterations will always contain at least as many sites satisfying the minimum distance requirement as the previous iteration. The algorithm stops when the minimum distance requirement is met or there are maxtry iterations. The default number of maximum iterations is 10.
n_over: The number of reverse hierarchically ordered (rho) replacement sites. If the sampling design is unstratified, then n_over is an integer specifying the number of rho replacement sites desired. If the sampling design is stratified, then n_over is a vector (or list) whose names match the names of n_base and whose values indicate the number of rho replacement sites for each stratum. If replacement sites are not desired for a particular stratum, then the corresponding value in n_over should be 0 or NULL (which is equivalent to 0). If the sampling design is stratified but the number of n_over sites is the same in each stratum, n_over can be a vector which is used for each stratum. If n_over is an unnamed, length-one vector, it's value is recycled and used for each stratum. Note that if the sampling design has unequal selection probabilities (seltype = "unequal"), then n_over sites are given the same proportion of caty_n values as n_base.
n_near: The number of nearest neighbor (nn) replacement sites. If the sampling design is unstratified, n_near is integer from 1 to 10 specifying the number of nn replacement sites to be selected for each base site. If the sampling design is stratified but the same number of nn replacement sites is desired for each stratum, n_near is integer from 1 to 10 specifying the number of nn replacement sites to be selected for each base site. If the sampling design is unstratified and a different number of nn replacement sites is desired for each stratum, n_near is a vector (or list) whose names represent strata and whose values is integer from 1 to 10 specifying the number of nn replacement sites to be selected for each base site in the stratum. If replacement sites are not desired for a particular stratum, then the corresponding value in n_over should be 0 or NULL (which is equivalent to 0). For infinite sampling frames, the distance between a site and its nn depends on pt_density. The larger pt_density, the closer the nn neighbors.
wgt_units: The units used to compute the design weights. These units must be standard units as defined by the set_units() function in the units package. The default units match the units of the sf object.
pt_density: A positive integer controlling the density of the GRTS approximation for infinite sampling frames. The GRTS approximation for infinite sample frames vastly improves computational efficiency by generating many finite points and selecting a sample from the points. pt_density represents the density of finite points per unit to use in the approximation. More specifically, for each stratum, the number of points used in the approximation equals pt_density * (n_base + n_over). A larger value of pt_density means a closer approximation to the infinite sampling frame but less computational efficiency. The default value of pt_density is 10. Note that when used with caty_n, the unequal inclusion probabilities generated from this approach are also approximations.
DesignID: A character string indicating the naming structure for each site's identifier selected in the sample, which is matched with SiteBegin and included as a variable in the sf object in the function's output. Default is "Site".
SiteBegin: A character string indicating the first number to use to match with DesignID while creating each site's identifier selected in the sample. Successive sites are given successive integers. The default starting number is 1 and the number of digits is equal to number of digits in nbase + nover. For example, if nbase is 50 and nover is 0, then the default site identifiers are Site-01 to Site-50
sep: A character string that acts as a separator between DesignID and SiteBegin. The default is "-".
projcrs_check: A check for whether the coordinates are projected. If TRUE, an error is returned if coordinates are not projected (i.e., they are geographic or NA). If FALSE, the check is not performed, which means that the crs in sframe (and legacy_sites if provided) can be projected, geographic, or NA.

Value

The sampling design sites and additional information about the sampling design. More specifically, it is, a list with five elements:

sites_legacy An sf object containing legacy sites. This is NULL if legacy sites were not included in the sample.
sites_base An sf object containing the base sites. This is NULL if n_base equals the number of legacy sites.
sites_over An sf object containing the reverse hierarchically ordered replacement sites. This is NULL if no reverse hierarchically ordered replacement sites were included in the sample.
sites_near An sf object containing the nearest neighbor replacement sites. This is NULL if no nearest neighbor replacement sites were included in the sample.
design A list documenting the specifications of this sampling design. This can be checked to verify your sampling design ran as intended.
- call The original function call.
- stratum_var The name of the stratification variable in sframe. This equals NULL if no stratification is used.
- stratum The unique strata. This equals "None" if the sampling design is unstratified.
- n_base The base sample size per stratum.
- seltype The selection type per stratum.
- caty_var The name of the unequal probability variable in sframe. This equals NULL if no unequal probability variable is used.
- caty_n The expected sample sizes for each level of the unequal probability grouping variable per stratum. This equals NULL when seltype is not "unequal".
- aux_var The name of the proportional probability (auxiliary) variable in sframe. This equals NULL if no proportional probability variable is used.
- legacy A logical variable indicating whether legacy sites were included in the sample.
- legacy_stratum_var The name of the stratification variable in legacy_sites. Omitted if legacy sites are not used. This equals NULL if legacy sites were used but no stratification variable is used.
- legacy_caty_var The name of the unequal probability variable in legacy_sites. Omitted if legacy sites are not used. This equals NULL if legacy sites were used but no unequal probability variable is used.
- legacy_aux_var The name of the proportional probability (auxiliary) variable in legacy_sites. Omitted if legacy sites are not used. This equals NULL if legacy sites were used but no proportional probability variable is used.
- mindis The minimum distance requirement desired. This is NULL when no minimum distance requirement was applied.
- n_over The reverse hierarchically ordered replacement site sample sizes per stratum. If seltype is unequal, this represents the expected sample sizes. This is NULL when no reverse hierarchically ordered replacement sites were selected.
- n_near The number of nearest neighbor replacement sites desired. This is NULL when no nearest neighbor replacement sites were selected.

When non-NULL, the sites_legacy, sites_base,

sites_over, and sites_near objects contain the original columns in sframe and include a few additional columns. These additional columns are

siteID A site identifier (as named using the DesignID and SiteBegin arguments to grts()).
siteuse Whether the site is a legacy site (Legacy), base site (Base), reverse hierarchically ordered replacement site (Over), or nearest neighbor replacement site (Near).
replsite The replacement site ordering. replsite is None if the site is not a replacement site, Next if it is the next reverse hierarchically ordered replacement site to use, or Near_, where the word following _ indicates the ordering of sites closest to the originally sampled site.
lon_WGS84 Longitude coordinates using the WGS84 coordinate system (EPSG:4326). Only given if coordinates are projected.
lat_WGS84 Latitude coordinates using the WGS84 coordinate system (EPSG:4326). Only given if coordinates are projected.
X Longitude coordinates using the provided coordinate system. Only given if coordinates are not projected (i.e., they are geographic or NA).
Y Latitude coordinates using the provided coordinate system. Only given if coordinates are not projected (i.e., they are geographic or NA).
stratum A stratum indicator. stratum is None if the sampling design was unstratified. If the sampling design was stratified, stratum indicates the stratum.
wgt The design weight.
ip The site's original inclusion probability (the reciprocal) of (wgt).
caty An unequal probability grouping indicator. caty is None if the sampling design did not use unequal inclusion probabilities. If the sampling design did use unequal inclusion probabilities, caty indicates the unequal probability level.
aux The auxiliary proportional probability variable. This column is only returned if seltype was proportional in the original sampling design.

If any columns in sframe contain these names, those columns from sframe will be automatically prefixed with sframe_

in the sites object. When output is printed, a summary of site counts by the levels in stratum_var and caty_var is shown.

Details

n_base is the number of sites used to calculate the design weights, which is typically the number of sites used in an analysis. When a panel sampling design is implemented, n_base is typically the number of sites in all panels that will be sampled in the same temporal period -- n_base is not the total number of sites in all panels. The sum of n_base and n_over is equal to the total number of sites to be visited for all panels plus any replacement sites that may be required.

Author

Tony Olsen olsen.tony@epa.gov

Examples