R/cont_cdftest.R
cont_cdftest.Rd
This function organizes input and output for conducting inference regarding cumulative distribution functions (CDFs) generated by a probability survey. For every response variable and every subpopulation (domain) variable, differences between CDFs are tested for every pair of subpopulations within the domain. Data input to the function can be either a single survey or multiple surveys (two or more). If the data contain multiple surveys, then the domain variables will reference those surveys and (potentially) subpopulations within those surveys. The inferential procedures divide the CDFs into a discrete set of intervals (classes) and then utilize procedures that have been developed for analysis of categorical data from probability surveys. Choices for inference are the Wald, adjusted Wald, Rao-Scott first order corrected (mean eigenvalue corrected), and Rao-Scott second order corrected (Satterthwaite corrected) test statistics. The default test statistic is the adjusted Wald statistic. The input data argument can be either a data frame or a simple features (sf) object. If an sf object is used, coordinates are extracted from the geometry column in the object, arguments xcoord and ycoord are assigned values "xcoord" and "ycoord", respectively, and the geometry column is dropped from the object.
cont_cdftest(
dframe,
vars,
subpops = NULL,
surveyID = NULL,
siteID = "siteID",
weight = "weight",
xcoord = NULL,
ycoord = NULL,
stratumID = NULL,
clusterID = NULL,
weight1 = NULL,
xcoord1 = NULL,
ycoord1 = NULL,
sizeweight = FALSE,
sweight = NULL,
sweight1 = NULL,
fpc = NULL,
popsize = NULL,
vartype = "Local",
jointprob = "overton",
testname = "adjWald",
nclass = 3
)
Data frame containing survey design variables, response variables, and subpopulation (domain) variables.
Vector composed of character values that identify the
names of response variables in the dframe
data frame.
Vector composed of character values that identify the
names of subpopulation (domain) variables in the dframe
data frame.
If a value is not provided, the value "All_Sites"
is assigned to the
subpops argument and a factor variable named "All_Sites"
that takes
the value "All Sites"
is added to the dframe
data frame. The
default value is NULL
.
Character value providing name of the survey ID variable in
the dframe
data frame. If this argument equals NULL
, then
the dframe data frame contains data for a single survey. The default value
is NULL
.
Character value providing name of the site ID variable in
the dframe
data frame. For a two-stage sample, the site ID variable
identifies stage two site IDs. The default value is "siteID"
.
Character value providing name of the survey design weight
variable in the dframe
data frame. For a two-stage sample, the
weight variable identifies stage two weights. The default value is
"weight"
.
Character value providing name of the x-coordinate variable in
the dframe
data frame. For a two-stage sample, the x-coordinate
variable identifies stage two x-coordinates. Note that x-coordinates are
required for calculation of the local mean variance estimator. The default
value is NULL
.
Character value providing name of the y-coordinate variable in
the dframe
data frame. For a two-stage sample, the y-coordinate
variable identifies stage two y-coordinates. Note that y-coordinates are
required for calculation of the local mean variance estimator. The default
value is NULL
.
Character value providing name of the stratum ID variable in
the dframe
data frame. The default value is NULL
.
Character value providing the name of the cluster
(stage one) ID variable in the dframe
data frame. Note that cluster
IDs are required for a two-stage sample. The default value is NULL
.
Character value providing name of the stage one weight
variable in the dframe
data frame. The default value is NULL
.
Character value providing the name of the stage one
x-coordinate variable in the dframe
data frame. Note that x
coordinates are required for calculation of the local mean variance
estimator. The default value is NULL
.
Character value providing the name of the stage one
y-coordinate variable in the dframe
data frame. Note that
y-coordinates are required for calculation of the local mean variance
estimator. The default value is NULL
.
Logical value that indicates whether size weights should be
used during estimation, where TRUE
uses size weights and
FALSE
does not use size weights. To employ size weights for a
single-stage sample, a value must be supplied for argument weight. To
employ size weights for a two-stage sample, values must be supplied for
arguments weight
and weight1
. The default value is FALSE
.
Character value providing the name of the size weight variable
in the dframe
data frame. For a two-stage sample, the size weight
variable identifies stage two size weights. The default value is
NULL
.
Character value providing name of the stage one size weight
variable in the dframe
data frame. The default value is NULL
.
Object that specifies values required for calculation of the finite population correction factor used during variance estimation. The object must match the survey design in terms of stratification and whether the design is single-stage or two-stage. For an unstratified design, the object is a vector. The vector is composed of a single numeric value for a single-stage design. For a two-stage unstratified design, the object is a named vector containing one more than the number of clusters in the sample, where the first item in the vector specifies the number of clusters in the population and each subsequent item specifies the number of stage two units for the cluster. The name for the first item in the vector is arbitrary. Subsequent names in the vector identify clusters and must match the cluster IDs. For a stratified design, the object is a named list of vectors, where names must match the strata IDs. For each stratum, the format of the vector is identical to the format described for unstratified single-stage and two-stage designs. Note that the finite population correction factor is not used with the local mean variance estimator.
Example fpc for a single-stage unstratified survey design:
fpc <- 15000
Example fpc for a single-stage stratified survey design:
fpc <- list(
Stratum_1 = 9000,
Stratum_2 = 6000)
Example fpc for a two-stage unstratified survey design:
fpc <- c(
Ncluster = 150,
Cluster_1 = 150,
Cluster_2 = 75,
Cluster_3 = 75,
Cluster_4 = 125,
Cluster_5 = 75)
Example fpc for a two-stage stratified survey design:
fpc <- list(
Stratum_1 = c(
Ncluster_1 = 100,
Cluster_1 = 125,
Cluster_2 = 100,
Cluster_3 = 100,
Cluster_4 = 125,
Cluster_5 = 50),
Stratum_2 = c(
Ncluster_2 = 50,
Cluster_1 = 75,
Cluster_2 = 150,
Cluster_3 = 75,
Cluster_4 = 75,
Cluster_5 = 125))
Object that provides values for the population argument of the
calibrate
or postStratify
functions in the survey package. If
a value is provided for popsize, then either the calibrate
or
postStratify
function is used to modify the survey design object
that is required by functions in the survey package. Whether to use the
calibrate
or postStratify
function is dictated by the format
of popsize, which is discussed below. Post-stratification adjusts the
sampling and replicate weights so that the joint distribution of a set of
post-stratifying variables matches the known population joint distribution.
Calibration, generalized raking, or GREG estimators generalize
post-stratification and raking by calibrating a sample to the marginal
totals of variables in a linear regression model. For the calibrate
function, the object is a named list, where the names identify factor
variables in the dframe
data frame. Each element of the list is a
named vector containing the population total for each level of the
associated factor variable. For the postStratify
function, the
object is either a data frame, table, or xtabs object that provides the
population total for all combinations of selected factor variables in the
dframe
data frame. If a data frame is used for popsize
, the
variable containing population totals must be the last variable in the data
frame. If a table is used for popsize
, the table must have named
dimnames
where the names identify factor variables in the
dframe
data frame. If the popsize argument is equal to NULL
,
then neither calibration nor post-stratification is performed. The default
value is NULL
.
Example popsize for calibration:
popsize <- list(
Ecoregion = c(
East = 750,
Central = 500,
West = 250),
Type = c(
Streams = 1150,
Rivers = 350))
Example popsize for post-stratification using a data frame:
popsize <- data.frame(
Ecoregion = rep(c("East", "Central", "West"),
rep(2, 3)),
Type = rep(c("Streams", "Rivers"), 3),
Total = c(575, 175, 400, 100, 175, 75))
Example popsize for post-stratification using a table:
popsize <- with(MySurveyFrame,
table(Ecoregion, Type))
Example popsize for post-stratification using an xtabs object:
popsize <- xtabs(~Ecoregion + Type,
data = MySurveyFrame)
Character value providing the choice of the variance
estimator, where "Local"
indicates the local mean estimator,
"SRS"
indicates the simple random sampling estimator, "HT"
indicates the Horvitz-Thompson estimator, and "YG"
indicates the
Yates-Grundy estimator. The default value is "Local"
.
Character value providing the choice of joint inclusion
probability approximation for use with Horvitz-Thompson and Yates-Grundy
variance estimators, where "overton"
indicates the Overton
approximation, "hr"
indicates the Hartley-Rao approximation, and
"brewer"
equals the Brewer approximation. The default value is
"overton"
.
Name of the test statistic to be reported in the output
data frame. Choices for the name are: "Wald"
, "adjWald"
,
"RaoScott_First"
, and "RaoScott_Second"
, which correspond to
the Wald statistic, adjusted Wald statistic, Rao-Scott first-order
corrected statistic, and Rao-Scott second-order corrected statistic,
respectively. The default is "adjWald"
.
Number of classes into which the CDFs will be divided
(binned), which must equal at least 2
. The default is 3
.
Data frame of CDF test results for all pairs of subpopulations
within each population type for every response variable. The data frame
includes the test statistic specified by argument testname
plus its
degrees of freedom and p-value.
cdf_plot
for visualizing CDF plots
cont_cdfplot
for making CDF plots output to pdfs
n <- 200
mysiteID <- paste("Site", 1:n, sep = "")
dframe <- data.frame(
siteID = mysiteID,
wgt = runif(n, 10, 100),
xcoord = runif(n),
ycoord = runif(n),
stratum = rep(c("Stratum1", "Stratum2"), n / 2),
Resource_Class = sample(c("Agr", "Forest", "Urban"), n, replace = TRUE)
)
ContVar <- numeric(n)
tst <- dframe$Resource_Class == "Agr"
ContVar[tst] <- rnorm(sum(tst), 10, 1)
tst <- dframe$Resource_Class == "Forest"
ContVar[tst] <- rnorm(sum(tst), 10.1, 1)
tst <- dframe$Resource_Class == "Urban"
ContVar[tst] <- rnorm(sum(tst), 10.5, 1)
dframe$ContVar <- ContVar
myvars <- c("ContVar")
mysubpops <- c("Resource_Class")
mypopsize <- data.frame(
Resource_Class = rep(c("Agr", "Forest", "Urban"), rep(2, 3)),
stratum = rep(c("Stratum1", "Stratum2"), 3),
Total = c(2500, 1500, 1000, 500, 600, 450)
)
cont_cdftest(dframe,
vars = myvars, subpops = mysubpops, siteID = "siteID",
weight = "wgt", xcoord = "xcoord", ycoord = "ycoord",
stratumID = "stratum", popsize = mypopsize, testname = "RaoScott_First"
)
#> Type Subpopulation_1 Subpopulation_2 Indicator
#> 1 Resource_Class Agr Forest ContVar
#> 2 Resource_Class Agr Urban ContVar
#> 3 Resource_Class Forest Urban ContVar
#> Rao-Scott First Order Statistic Degrees_of_Freedom p_Value
#> 1 0.512239 2 0.822142193
#> 2 8.691382 2 0.009796616
#> 3 14.275884 2 0.005918207