This function organizes input and output for the analysis of categorical variables. The analysis data,
dframe
, can be either a data frame or a simple features (sf
) object. If an
sf
object is used, coordinates are extracted from the geometry column in the
object, arguments xcoord
and ycoord
are assigned values
"xcoord"
and "ycoord"
, respectively, and the geometry column is
dropped from the object.
cat_analysis(
dframe,
vars,
subpops = NULL,
siteID = NULL,
weight = "weight",
xcoord = NULL,
ycoord = NULL,
stratumID = NULL,
clusterID = NULL,
weight1 = NULL,
xcoord1 = NULL,
ycoord1 = NULL,
sizeweight = FALSE,
sweight = NULL,
sweight1 = NULL,
fpc = NULL,
popsize = NULL,
vartype = "Local",
jointprob = "overton",
conf = 95,
All_Sites = FALSE
)
Data to be analyzed (analysis data). A data frame or
sf
object containing survey design
variables, response variables, and subpopulation (domain) variables.
Vector composed of character values that identify the
names of response variables in dframe
.
Vector composed of character values that identify the
names of subpopulation (domain) variables in dframe
.
If a value is not provided, the value "All_Sites"
is assigned to the
subpops argument and a factor variable named "All_Sites"
that takes
the value "All Sites"
is added to the dframe
data frame. The
default value is NULL
.
Character value providing name of the site ID variable in
the dframe
data frame. For a two-stage sample, the site ID variable
identifies stage two site IDs. The default value is NULL
, which
assumes that each row in dframe
represents a unique site.
Character value providing name of the design weight
variable in dframe
. For a two-stage sample, the
weight variable identifies stage two weights. The default value is
"weight"
.
Character value providing name of the x-coordinate variable in
the dframe
data frame. For a two-stage sample, the x-coordinate
variable identifies stage two x-coordinates. Note that x-coordinates are
required for calculation of the local mean variance estimator. If dframe
is an sf
object, this argument is not required (as the geometry column
in dframe
is used to find the x-coordinate). The default
value is NULL
.
Character value providing name of the y-coordinate variable in
the dframe
data frame. For a two-stage sample, the y-coordinate
variable identifies stage two y-coordinates. Note that y-coordinates are
required for calculation of the local mean variance estimator. If dframe
is an sf
object, this argument is not required (as the geometry column
in dframe
is used to find the y-coordinate). The default
value is NULL
.
Character value providing name of the stratum ID variable in
the dframe
data frame. The default value is NULL
.
Character value providing the name of the cluster
(stage one) ID variable in dframe
. Note that cluster
IDs are required for a two-stage sample. The default value is NULL
.
Character value providing name of the stage one weight
variable in dframe
. The default value is NULL
.
Character value providing the name of the stage one
x-coordinate variable in dframe
. Note that x
coordinates are required for calculation of the local mean variance
estimator. The default value is NULL
.
Character value providing the name of the stage one
y-coordinate variable in dframe
. Note that
y-coordinates are required for calculation of the local mean variance
estimator. The default value is NULL
.
Logical value that indicates whether size weights should be
used during estimation, where TRUE
uses size weights and
FALSE
does not use size weights. To employ size weights for a
single-stage sample, a value must be supplied for argument weight. To
employ size weights for a two-stage sample, values must be supplied for
arguments weight
and weight1
. The default value is FALSE
.
Character value providing the name of the size weight variable
in dframe
. For a two-stage sample, the size weight
variable identifies stage two size weights. The default value is
NULL
.
Character value providing name of the stage one size weight
variable in dframe
. The default value is NULL
.
Object that specifies values required for calculation of the finite population correction factor used during variance estimation. The object must match the survey design in terms of stratification and whether the design is single-stage or two-stage. For an unstratified design, the object is a vector. The vector is composed of a single numeric value for a single-stage design. For a two-stage unstratified design, the object is a named vector containing one more than the number of clusters in the sample, where the first item in the vector specifies the number of clusters in the population and each subsequent item specifies the number of stage two units for the cluster. The name for the first item in the vector is arbitrary. Subsequent names in the vector identify clusters and must match the cluster IDs. For a stratified design, the object is a named list of vectors, where names must match the strata IDs. For each stratum, the format of the vector is identical to the format described for unstratified single-stage and two-stage designs. Note that the finite population correction factor is not used with the local mean variance estimator.
Example fpc for a single-stage unstratified survey design:
fpc <- 15000
Example fpc for a single-stage stratified survey design:
fpc <- list(
Stratum_1 = 9000,
Stratum_2 = 6000)
Example fpc for a two-stage unstratified survey design:
fpc <- c(
Ncluster = 150,
Cluster_1 = 150,
Cluster_2 = 75,
Cluster_3 = 75,
Cluster_4 = 125,
Cluster_5 = 75)
Example fpc for a two-stage stratified survey design:
fpc <- list(
Stratum_1 = c(
Ncluster_1 = 100,
Cluster_1 = 125,
Cluster_2 = 100,
Cluster_3 = 100,
Cluster_4 = 125,
Cluster_5 = 50),
Stratum_2 = c(
Ncluster_2 = 50,
Cluster_1 = 75,
Cluster_2 = 150,
Cluster_3 = 75,
Cluster_4 = 75,
Cluster_5 = 125))
Object that provides values for the population argument of the
calibrate
or postStratify
functions in the survey package. If
a value is provided for popsize, then either the calibrate
or
postStratify
function is used to modify the survey design object
that is required by functions in the survey package. Whether to use the
calibrate
or postStratify
function is dictated by the format
of popsize, which is discussed below. Post-stratification adjusts the
sampling and replicate weights so that the joint distribution of a set of
post-stratifying variables matches the known population joint distribution.
Calibration, generalized raking, or GREG estimators generalize
post-stratification and raking by calibrating a sample to the marginal
totals of variables in a linear regression model. For the calibrate
function, the object is a named list, where the names identify factor
variables in dframe
. Each element of the list is a
named vector containing the population total for each level of the
associated factor variable. For the postStratify
function, the
object is either a data frame, table, or xtabs object that provides the
population total for all combinations of selected factor variables in the
dframe
data frame. If a data frame is used for popsize
, the
variable containing population totals must be the last variable in the data
frame. If a table is used for popsize
, the table must have named
dimnames
where the names identify factor variables in the
dframe
data frame. If the popsize argument is equal to NULL
,
then neither calibration nor post-stratification is performed. The default
value is NULL
.
Example popsize for calibration:
popsize <- list(
Ecoregion = c(
East = 750,
Central = 500,
West = 250),
Type = c(
Streams = 1150,
Rivers = 350))
Example popsize for post-stratification using a data frame:
popsize <- data.frame(
Ecoregion = rep(c("East", "Central", "West"),
rep(2, 3)),
Type = rep(c("Streams", "Rivers"), 3),
Total = c(575, 175, 400, 100, 175, 75))
Example popsize for post-stratification using a table:
popsize <- with(MySurveyFrame,
table(Ecoregion, Type))
Example popsize for post-stratification using an xtabs object:
popsize <- xtabs(~Ecoregion + Type,
data = MySurveyFrame)
Character value providing the choice of the variance
estimator, where "Local"
indicates the local mean estimator,
"SRS"
indicates the simple random sampling estimator, "HT"
indicates the Horvitz-Thompson estimator, and "YG"
indicates the
Yates-Grundy estimator. The default value is "Local"
.
Character value providing the choice of joint inclusion
probability approximation for use with Horvitz-Thompson and Yates-Grundy
variance estimators, where "overton"
indicates the Overton
approximation, "hr"
indicates the Hartley-Rao approximation, and
"brewer"
equals the Brewer approximation. The default value is
"overton"
.
Numeric value providing the Gaussian-based confidence level. The default value
is 95
.
A logical variable used when subpops
is not
NULL
. If All_Sites
is TRUE
, then alongside the
subpopulation output, output for all sites (ignoring subpopulations) is
returned for each variable in vars
. If All_Sites
is
FALSE
, then alongside the subpopulation output, output for all sites
(ignoring subpopulations) is not returned for each variable in vars
.
The default is FALSE
.
The analysis results. A data frame of population estimates for all combinations of subpopulations, categories within each subpopulation, response variables, and categories within each response variable. Estimates are provided for proportion and total of the population plus standard error, margin of error, and confidence interval estimates. The data frame contains the following variables:
subpopulation (domain) name
subpopulation name within a domain
response variable
category of response variable
sample size
proportion estimate (in %)
standard error of proportion estimate
margin of error of proportion estimate
xx% (default 95%) lower confidence bound of proportion estimate
xx% (default 95%) upper confidence bound of proportion estimate
total estimate
standard error of total estimate
margin of error of total estimate
xx% (default 95%) lower confidence bound of total estimate
xx% (default 95%) upper confidence bound of total estimate
cont_analysis
for continuous variable analysis
dframe <- data.frame(
siteID = paste0("Site", 1:100),
wgt = runif(100, 10, 100),
xcoord = runif(100),
ycoord = runif(100),
stratum = rep(c("Stratum1", "Stratum2"), 50),
CatVar = rep(c("north", "south", "east", "west"), 25),
All_Sites = rep("All Sites", 100),
Resource_Class = rep(c("Good", "Poor"), c(55, 45))
)
myvars <- c("CatVar")
mysubpops <- c("All_Sites", "Resource_Class")
mypopsize <- data.frame(
Resource_Class = c("Good", "Poor"),
Total = c(4000, 1500)
)
cat_analysis(dframe,
vars = myvars, subpops = mysubpops, siteID = "siteID",
weight = "wgt", xcoord = "xcoord", ycoord = "ycoord",
stratumID = "stratum", popsize = mypopsize
)
#> Type Subpopulation Indicator Category nResp Estimate.P StdError.P
#> 1 All_Sites All Sites CatVar east 25 24.56407 3.655213
#> 2 All_Sites All Sites CatVar north 25 26.94039 3.655213
#> 3 All_Sites All Sites CatVar south 25 25.31730 3.360741
#> 4 All_Sites All Sites CatVar west 25 23.17824 3.360741
#> 5 All_Sites All Sites CatVar Total 100 100.00000 0.000000
#> 6 Resource_Class Good CatVar east 14 22.54320 4.807093
#> 7 Resource_Class Good CatVar north 14 28.93427 4.807093
#> 8 Resource_Class Good CatVar south 14 25.83591 4.439926
#> 9 Resource_Class Good CatVar west 13 22.68661 4.439926
#> 10 Resource_Class Good CatVar Total 55 100.00000 0.000000
#> 11 Resource_Class Poor CatVar east 11 29.95303 5.373134
#> 12 Resource_Class Poor CatVar north 11 21.62338 5.373134
#> 13 Resource_Class Poor CatVar south 11 23.93435 4.086409
#> 14 Resource_Class Poor CatVar west 12 24.48924 4.086409
#> 15 Resource_Class Poor CatVar Total 45 100.00000 0.000000
#> MarginofError.P LCB95Pct.P UCB95Pct.P Estimate.U StdError.U MarginofError.U
#> 1 7.164087 17.39998 31.72815 1351.0237 202.58725 397.0637
#> 2 7.164087 19.77630 34.10448 1481.7215 252.67329 495.2306
#> 3 6.586931 18.73037 31.90424 1392.4518 208.32033 408.3003
#> 4 6.586931 16.59131 29.76517 1274.8030 209.95998 411.5140
#> 5 0.000000 100.00000 100.00000 5500.0000 0.00000 0.0000
#> 6 9.421730 13.12148 31.96493 901.7282 179.01008 350.8533
#> 7 9.421730 19.51254 38.35600 1157.3708 245.47227 481.1168
#> 8 8.702095 17.13382 34.53801 1033.4366 194.99544 382.1840
#> 9 8.702095 13.98452 31.38871 907.4645 184.72902 362.0622
#> 10 0.000000 100.00000 100.00000 4000.0000 0.00000 0.0000
#> 11 10.531150 19.42188 40.48418 449.2955 97.12757 190.3665
#> 12 10.531150 11.09223 32.15453 324.3508 78.92257 154.6854
#> 13 8.009215 15.92513 31.94356 359.0152 69.13670 135.5054
#> 14 8.009215 16.48002 32.49845 367.3386 65.75138 128.8703
#> 15 0.000000 100.00000 100.00000 1500.0000 0.00000 0.0000
#> LCB95Pct.U UCB95Pct.U
#> 1 953.9600 1748.0874
#> 2 986.4910 1976.9521
#> 3 984.1514 1800.7521
#> 4 863.2891 1686.3170
#> 5 5500.0000 5500.0000
#> 6 550.8749 1252.5815
#> 7 676.2539 1638.4876
#> 8 651.2525 1415.6206
#> 9 545.4022 1269.5267
#> 10 4000.0000 4000.0000
#> 11 258.9289 639.6620
#> 12 169.6654 479.0362
#> 13 223.5098 494.5207
#> 14 238.4682 496.2089
#> 15 1500.0000 1500.0000