This function organizes input and output for the analysis of categorical variables. The analysis data, dframe, can be either a data frame or a simple features (sf) object. If an sf object is used, coordinates are extracted from the geometry column in the object, arguments xcoord and ycoord are assigned values "xcoord" and "ycoord", respectively, and the geometry column is dropped from the object.

cat_analysis(
  dframe,
  vars,
  subpops = NULL,
  siteID = NULL,
  weight = "weight",
  xcoord = NULL,
  ycoord = NULL,
  stratumID = NULL,
  clusterID = NULL,
  weight1 = NULL,
  xcoord1 = NULL,
  ycoord1 = NULL,
  sizeweight = FALSE,
  sweight = NULL,
  sweight1 = NULL,
  fpc = NULL,
  popsize = NULL,
  vartype = "Local",
  jointprob = "overton",
  conf = 95,
  All_Sites = FALSE
)

Arguments

dframe

Data to be analyzed (analysis data). A data frame or sf object containing survey design variables, response variables, and subpopulation (domain) variables.

vars

Vector composed of character values that identify the names of response variables in dframe.

subpops

Vector composed of character values that identify the names of subpopulation (domain) variables in dframe. If a value is not provided, the value "All_Sites" is assigned to the subpops argument and a factor variable named "All_Sites" that takes the value "All Sites" is added to the dframe data frame. The default value is NULL.

siteID

Character value providing name of the site ID variable in the dframe data frame. For a two-stage sample, the site ID variable identifies stage two site IDs. The default value is NULL, which assumes that each row in dframe represents a unique site.

weight

Character value providing name of the design weight variable in dframe. For a two-stage sample, the weight variable identifies stage two weights. The default value is "weight".

xcoord

Character value providing name of the x-coordinate variable in the dframe data frame. For a two-stage sample, the x-coordinate variable identifies stage two x-coordinates. Note that x-coordinates are required for calculation of the local mean variance estimator. If dframe is an sf object, this argument is not required (as the geometry column in dframe is used to find the x-coordinate). The default value is NULL.

ycoord

Character value providing name of the y-coordinate variable in the dframe data frame. For a two-stage sample, the y-coordinate variable identifies stage two y-coordinates. Note that y-coordinates are required for calculation of the local mean variance estimator. If dframe is an sf object, this argument is not required (as the geometry column in dframe is used to find the y-coordinate). The default value is NULL.

stratumID

Character value providing name of the stratum ID variable in the dframe data frame. The default value is NULL.

clusterID

Character value providing the name of the cluster (stage one) ID variable in dframe. Note that cluster IDs are required for a two-stage sample. The default value is NULL.

weight1

Character value providing name of the stage one weight variable in dframe. The default value is NULL.

xcoord1

Character value providing the name of the stage one x-coordinate variable in dframe. Note that x coordinates are required for calculation of the local mean variance estimator. The default value is NULL.

ycoord1

Character value providing the name of the stage one y-coordinate variable in dframe. Note that y-coordinates are required for calculation of the local mean variance estimator. The default value is NULL.

sizeweight

Logical value that indicates whether size weights should be used during estimation, where TRUE uses size weights and FALSE does not use size weights. To employ size weights for a single-stage sample, a value must be supplied for argument weight. To employ size weights for a two-stage sample, values must be supplied for arguments weight and weight1. The default value is FALSE.

sweight

Character value providing the name of the size weight variable in dframe. For a two-stage sample, the size weight variable identifies stage two size weights. The default value is NULL.

sweight1

Character value providing name of the stage one size weight variable in dframe. The default value is NULL.

fpc

Object that specifies values required for calculation of the finite population correction factor used during variance estimation. The object must match the survey design in terms of stratification and whether the design is single-stage or two-stage. For an unstratified design, the object is a vector. The vector is composed of a single numeric value for a single-stage design. For a two-stage unstratified design, the object is a named vector containing one more than the number of clusters in the sample, where the first item in the vector specifies the number of clusters in the population and each subsequent item specifies the number of stage two units for the cluster. The name for the first item in the vector is arbitrary. Subsequent names in the vector identify clusters and must match the cluster IDs. For a stratified design, the object is a named list of vectors, where names must match the strata IDs. For each stratum, the format of the vector is identical to the format described for unstratified single-stage and two-stage designs. Note that the finite population correction factor is not used with the local mean variance estimator.

Example fpc for a single-stage unstratified survey design:

fpc <- 15000

Example fpc for a single-stage stratified survey design:

fpc <- list( Stratum_1 = 9000, Stratum_2 = 6000)

Example fpc for a two-stage unstratified survey design:

fpc <- c( Ncluster = 150, Cluster_1 = 150, Cluster_2 = 75, Cluster_3 = 75, Cluster_4 = 125, Cluster_5 = 75)

Example fpc for a two-stage stratified survey design:

fpc <- list( Stratum_1 = c( Ncluster_1 = 100, Cluster_1 = 125, Cluster_2 = 100, Cluster_3 = 100, Cluster_4 = 125, Cluster_5 = 50), Stratum_2 = c( Ncluster_2 = 50, Cluster_1 = 75, Cluster_2 = 150, Cluster_3 = 75, Cluster_4 = 75, Cluster_5 = 125))

popsize

Object that provides values for the population argument of the calibrate or postStratify functions in the survey package. If a value is provided for popsize, then either the calibrate or postStratify function is used to modify the survey design object that is required by functions in the survey package. Whether to use the calibrate or postStratify function is dictated by the format of popsize, which is discussed below. Post-stratification adjusts the sampling and replicate weights so that the joint distribution of a set of post-stratifying variables matches the known population joint distribution. Calibration, generalized raking, or GREG estimators generalize post-stratification and raking by calibrating a sample to the marginal totals of variables in a linear regression model. For the calibrate function, the object is a named list, where the names identify factor variables in dframe. Each element of the list is a named vector containing the population total for each level of the associated factor variable. For the postStratify function, the object is either a data frame, table, or xtabs object that provides the population total for all combinations of selected factor variables in the dframe data frame. If a data frame is used for popsize, the variable containing population totals must be the last variable in the data frame. If a table is used for popsize, the table must have named dimnames where the names identify factor variables in the dframe data frame. If the popsize argument is equal to NULL, then neither calibration nor post-stratification is performed. The default value is NULL.

Example popsize for calibration:

popsize <- list( Ecoregion = c( East = 750, Central = 500, West = 250), Type = c( Streams = 1150, Rivers = 350))

Example popsize for post-stratification using a data frame:

popsize <- data.frame( Ecoregion = rep(c("East", "Central", "West"), rep(2, 3)), Type = rep(c("Streams", "Rivers"), 3), Total = c(575, 175, 400, 100, 175, 75))

Example popsize for post-stratification using a table:

popsize <- with(MySurveyFrame, table(Ecoregion, Type))

Example popsize for post-stratification using an xtabs object:

popsize <- xtabs(~Ecoregion + Type, data = MySurveyFrame)

vartype

Character value providing the choice of the variance estimator, where "Local" indicates the local mean estimator, "SRS" indicates the simple random sampling estimator, "HT" indicates the Horvitz-Thompson estimator, and "YG" indicates the Yates-Grundy estimator. The default value is "Local".

jointprob

Character value providing the choice of joint inclusion probability approximation for use with Horvitz-Thompson and Yates-Grundy variance estimators, where "overton" indicates the Overton approximation, "hr" indicates the Hartley-Rao approximation, and "brewer" equals the Brewer approximation. The default value is "overton".

conf

Numeric value providing the Gaussian-based confidence level. The default value is 95.

All_Sites

A logical variable used when subpops is not NULL. If All_Sites is TRUE, then alongside the subpopulation output, output for all sites (ignoring subpopulations) is returned for each variable in vars. If All_Sites is FALSE, then alongside the subpopulation output, output for all sites (ignoring subpopulations) is not returned for each variable in vars. The default is FALSE.

Value

The analysis results. A data frame of population estimates for all combinations of subpopulations, categories within each subpopulation, response variables, and categories within each response variable. Estimates are provided for proportion and total of the population plus standard error, margin of error, and confidence interval estimates. The data frame contains the following variables:

Type

subpopulation (domain) name

Subpopulation

subpopulation name within a domain

Indicator

response variable

Category

category of response variable

nResp

sample size

Estimate.P

proportion estimate (in %)

StdError.P

standard error of proportion estimate

MarginofError.P

margin of error of proportion estimate

LCBxxPct.P

xx% (default 95%) lower confidence bound of proportion estimate

UCBxxPct.P

xx% (default 95%) upper confidence bound of proportion estimate

Estimate.U

total estimate

StdError.U

standard error of total estimate

MarginofError.U

margin of error of total estimate

LCBxxPct.U

xx% (default 95%) lower confidence bound of total estimate

UCBxxPct.U

xx% (default 95%) upper confidence bound of total estimate

See also

cont_analysis

for continuous variable analysis

Author

Tom Kincaid Kincaid.Tom@epa.gov

Examples

dframe <- data.frame(
  siteID = paste0("Site", 1:100),
  wgt = runif(100, 10, 100),
  xcoord = runif(100),
  ycoord = runif(100),
  stratum = rep(c("Stratum1", "Stratum2"), 50),
  CatVar = rep(c("north", "south", "east", "west"), 25),
  All_Sites = rep("All Sites", 100),
  Resource_Class = rep(c("Good", "Poor"), c(55, 45))
)
myvars <- c("CatVar")
mysubpops <- c("All_Sites", "Resource_Class")
mypopsize <- data.frame(
  Resource_Class = c("Good", "Poor"),
  Total = c(4000, 1500)
)
cat_analysis(dframe,
  vars = myvars, subpops = mysubpops, siteID = "siteID",
  weight = "wgt", xcoord = "xcoord", ycoord = "ycoord",
  stratumID = "stratum", popsize = mypopsize
)
#>              Type Subpopulation Indicator Category nResp Estimate.P StdError.P
#> 1       All_Sites     All Sites    CatVar     east    25   24.56407   3.655213
#> 2       All_Sites     All Sites    CatVar    north    25   26.94039   3.655213
#> 3       All_Sites     All Sites    CatVar    south    25   25.31730   3.360741
#> 4       All_Sites     All Sites    CatVar     west    25   23.17824   3.360741
#> 5       All_Sites     All Sites    CatVar    Total   100  100.00000   0.000000
#> 6  Resource_Class          Good    CatVar     east    14   22.54320   4.807093
#> 7  Resource_Class          Good    CatVar    north    14   28.93427   4.807093
#> 8  Resource_Class          Good    CatVar    south    14   25.83591   4.439926
#> 9  Resource_Class          Good    CatVar     west    13   22.68661   4.439926
#> 10 Resource_Class          Good    CatVar    Total    55  100.00000   0.000000
#> 11 Resource_Class          Poor    CatVar     east    11   29.95303   5.373134
#> 12 Resource_Class          Poor    CatVar    north    11   21.62338   5.373134
#> 13 Resource_Class          Poor    CatVar    south    11   23.93435   4.086409
#> 14 Resource_Class          Poor    CatVar     west    12   24.48924   4.086409
#> 15 Resource_Class          Poor    CatVar    Total    45  100.00000   0.000000
#>    MarginofError.P LCB95Pct.P UCB95Pct.P Estimate.U StdError.U MarginofError.U
#> 1         7.164087   17.39998   31.72815  1351.0237  202.58725        397.0637
#> 2         7.164087   19.77630   34.10448  1481.7215  252.67329        495.2306
#> 3         6.586931   18.73037   31.90424  1392.4518  208.32033        408.3003
#> 4         6.586931   16.59131   29.76517  1274.8030  209.95998        411.5140
#> 5         0.000000  100.00000  100.00000  5500.0000    0.00000          0.0000
#> 6         9.421730   13.12148   31.96493   901.7282  179.01008        350.8533
#> 7         9.421730   19.51254   38.35600  1157.3708  245.47227        481.1168
#> 8         8.702095   17.13382   34.53801  1033.4366  194.99544        382.1840
#> 9         8.702095   13.98452   31.38871   907.4645  184.72902        362.0622
#> 10        0.000000  100.00000  100.00000  4000.0000    0.00000          0.0000
#> 11       10.531150   19.42188   40.48418   449.2955   97.12757        190.3665
#> 12       10.531150   11.09223   32.15453   324.3508   78.92257        154.6854
#> 13        8.009215   15.92513   31.94356   359.0152   69.13670        135.5054
#> 14        8.009215   16.48002   32.49845   367.3386   65.75138        128.8703
#> 15        0.000000  100.00000  100.00000  1500.0000    0.00000          0.0000
#>    LCB95Pct.U UCB95Pct.U
#> 1    953.9600  1748.0874
#> 2    986.4910  1976.9521
#> 3    984.1514  1800.7521
#> 4    863.2891  1686.3170
#> 5   5500.0000  5500.0000
#> 6    550.8749  1252.5815
#> 7    676.2539  1638.4876
#> 8    651.2525  1415.6206
#> 9    545.4022  1269.5267
#> 10  4000.0000  4000.0000
#> 11   258.9289   639.6620
#> 12   169.6654   479.0362
#> 13   223.5098   494.5207
#> 14   238.4682   496.2089
#> 15  1500.0000  1500.0000