Trend analysis

This function organizes input and output for estimation of trend across time for a series of samples (for categorical and continuous variables). Trend is estimated using the analytical procedure identified by the model arguments. For categorical variables, the choices for the model_cat argument are: (1) simple linear regression, (2) weighted linear regression, and (3) generalized linear mixed-effects model. For continuous variables, the choices for the model_cont argument are: (1) simple linear regression, (2) weighted linear regression, and (3) linear mixed-effects model. The analysis data, dframe, can be either a data frame or a simple features (sf) object. If an sf object is used, coordinates are extracted from the geometry column in the object, arguments xcoord and ycoord are assigned values "xcoord" and "ycoord", respectively, and the geometry column is dropped from the object.

trend_analysis(
  dframe,
  vars_cat = NULL,
  vars_cont = NULL,
  subpops = NULL,
  model_cat = "SLR",
  cat_rhs = NULL,
  model_cont = "LMM",
  cont_rhs = NULL,
  siteID = "siteID",
  yearID = "year",
  weight = "weight",
  xcoord = NULL,
  ycoord = NULL,
  stratumID = NULL,
  clusterID = NULL,
  weight1 = NULL,
  xcoord1 = NULL,
  ycoord1 = NULL,
  sizeweight = FALSE,
  sweight = NULL,
  sweight1 = NULL,
  fpc = NULL,
  popsize = NULL,
  invprboot = TRUE,
  nboot = 1000,
  vartype = "Local",
  jointprob = "overton",
  conf = 95,
  All_Sites = FALSE
)

Arguments

dframe

Data to be analyzed (analysis data). A data frame or sf object containing survey design variables, response variables, and subpopulation (domain) variables.

vars_cat

Vector composed of character values that identify the names of categorical response variables in dframe. If argument model_cat equals "GLMM", the categorical variables in the dframe data frame must be factors each of which has two levels, where the second level will be assumed to specify "success". The default value is NULL.

vars_cont

Vector composed of character values that identify the names of continuous response variables in dframe. The default value is NULL.

subpops

Vector composed of character values that identify the names of subpopulation (domain) variables in dframe. If a value is not provided, the value "All_Sites" is assigned to the subpops argument and a factor variable named "All_Sites" that takes the value "All Sites" is added to dframe. The default value is NULL.

model_cat

Character value identifying the analytical procedure used for trend estimation for categorical variables. The choices are: "SLR" (simple linear regression), "WLR" (weighted linear regression), and "GLMM" (generalized linear mixed-effects model). The default value is "SLR".

cat_rhs

Character value specifying the right hand side of the formula for a generalized linear mixed-effects model. If a value is not provided, the argument is assigned a value that specifies the Piepho and Ogutu (2002) model. The default value is NULL.

model_cont

Character value identifying the analytical procedure used for trend estimation for continuous variables. The choices are: "SLR" (simple linear regression), "WLR" (weighted linear regression), and "LMM" (linear mixed-effects model). The default value is "LMM".

cont_rhs

Character value specifying the right hand side of the formula for a linear mixed-effects model. If a value is not provided, the argument is assigned a value that specifies the Piepho and Ogutu (2002) model. The default value is NULL.

siteID

Character value providing name of the site ID variable in dframe. If repeat visit sites are present, the site ID value for each revisit site will be the same for each survey. For a two-stage sample, the site ID variable identifies stage two site IDs. The default value is "siteID".

yearID

Character value providing name of the time period variable in dframe, which must be numeric and will be forced to numeric if it is not. The default assumption is that the time period variable is years. The default value is "year".

weight

Character value providing name of the design weight variable in dframe. For a two-stage sample, the weight variable identifies stage two weights. The default value is "weight".

xcoord

Character value providing name of the x-coordinate variable in dframe. For a two-stage sample, the x-coordinate variable identifies stage two x-coordinates. Note that x-coordinates are required for calculation of the local mean variance estimator. If dframe is an sf object, this argument is not required (as the geometry column in dframe is used to find the x-coordinate). The default value is NULL.

ycoord

Character value providing name of the y-coordinate variable in dframe. For a two-stage sample, the y-coordinate variable identifies stage two y-coordinates. Note that y-coordinates are required for calculation of the local mean variance estimator. If dframe is an sf object, this argument is not required (as the geometry column in dframe is used to find the y-coordinate). The default value is NULL.

stratumID

Character value providing name of the stratum ID variable in dframe. The default value is NULL.

clusterID

Character value providing name of the cluster (stage one) ID variable in dframe. Note that cluster IDs are required for a two-stage sample. The default value is NULL.

weight1

Character value providing name of the stage one weight variable in dframe. The default value is NULL.

xcoord1

Character value providing name of the stage one x-coordinate variable in dframe. Note that x-coordinates are required for calculation of the local mean variance estimator. The default value is NULL.

ycoord1

Character value providing name of the stage one y-coordinate variable in dframe. Note that y-coordinates are required for calculation of the local mean variance estimator. The default value is NULL.

sizeweight

Logical value that indicates whether size weights should be used during estimation, where TRUE = use size weights and FALSE = do not use size weights. To employ size weights for a single-stage sample, a value must be supplied for argument weight. To employ size weights for a two-stage sample, values must be supplied for arguments weight and weight1. The default value is FALSE.

sweight

Character value providing name of the size weight variable in dframe. For a two-stage sample, the size weight variable identifies stage two size weights. The default value is NULL.

sweight1

Character value providing name of the stage one size weight variable in dframe. The default value is NULL.

fpc

Object that specifies values required for calculation of the finite population correction factor used during variance estimation. The object must match the survey design in terms of stratification and whether the design is single-stage or two-stage. For an unstratified design, the object is a vector. The vector is composed of a single numeric value for a single-stage design. For a two-stage unstratified design, the object is a named vector containing one more than the number of clusters in the sample, where the first item in the vector specifies the number of clusters in the population and each subsequent item specifies the number of stage two units for the cluster. The name for the first item in the vector is arbitrary. Subsequent names in the vector identify clusters and must match the cluster IDs. For a stratified design, the object is a named list of vectors, where names must match the strata IDs. For each stratum, the format of the vector is identical to the format described for unstratified single-stage and two-stage designs. Note that the finite population correction factor is not used with the local mean variance estimator.

Example fpc for a single-stage unstratified survey design:

fpc <- 15000

Example fpc for a single-stage stratified survey design:

fpc <- list( Stratum_1 = 9000, Stratum_2 = 6000)

Example fpc for a two-stage unstratified survey design:

fpc <- c( Ncluster = 150, Cluster_1 = 150, Cluster_2 = 75, Cluster_3 = 75, Cluster_4 = 125, Cluster_5 = 75)

Example fpc for a two-stage stratified survey design:

fpc <- list( Stratum_1 = c( Ncluster_1 = 100, Cluster_1 = 125, Cluster_2 = 100, Cluster_3 = 100, Cluster_4 = 125, Cluster_5 = 50), Stratum_2 = c( Ncluster_2 = 50, Cluster_1 = 75, Cluster_2 = 150, Cluster_3 = 75, Cluster_4 = 75, Cluster_5 = 125))

popsize

Object that provides values for the population argument of the calibrate or postStratify functions in the survey package. If a value is provided for popsize, then either the calibrate or postStratify function is used to modify the survey design object that is required by functions in the survey package. Whether to use the calibrate or postStratify function is dictated by the format of popsize, which is discussed below. Post-stratification adjusts the sampling and replicate weights so that the joint distribution of a set of post-stratifying variables matches the known population joint distribution. Calibration, generalized raking, or GREG estimators generalize post-stratification and raking by calibrating a sample to the marginal totals of variables in a linear regression model. For the calibrate function, the object is a named list, where the names identify factor variables in dframe. Each element of the list is a named vector containing the population total for each level of the associated factor variable. For the postStratify function, the object is either a data frame, table, or xtabs object that provides the population total for all combinations of selected factor variables in the dframe data frame. If a data frame is used for popsize, the variable containing population totals must be the last variable in the data frame. If a table is used for popsize, the table must have named dimnames where the names identify factor variables in the dframe data frame. If the popsize argument is equal to NULL, then neither calibration nor post-stratification is performed. The default value is NULL.

Example popsize for calibration:

popsize <- list( Ecoregion = c( East = 750, Central = 500, West = 250), Type = c( Streams = 1150, Rivers = 350))

Example popsize for post-stratification using a data frame:

popsize <- data.frame( Ecoregion = rep(c("East", "Central", "West"), rep(2, 3)), Type = rep(c("Streams", "Rivers"), 3), Total = c(575, 175, 400, 100, 175, 75))

Example popsize for post-stratification using a table:

popsize <- with(MySurveyFrame, table(Ecoregion, Type))

Example popsize for post-stratification using an xtabs object:

popsize <- xtabs(~Ecoregion + Type, data = MySurveyFrame)

invprboot

Logical value that indicates whether the inverse probability bootstrap procedure is used to calculate trend parameter estimates. This bootstrap procedure is only available for the "LMM" option for continuous variables. Inverse probability references the design weights, which are the inverse of the sample inclusion probabilities. The default value is TRUE.

nboot

Numeric value for the number of bootstrap iterations. The default is 1000.

vartype

Character value providing choice of the variance estimator, where "Local" = the local mean estimator, "SRS" = the simple random sampling estimator, "HT" = the Horvitz-Thompson estimator, and "YG" = the Yates-Grundy estimator. The default value is "Local".

jointprob

Character value providing choice of joint inclusion probability approximation for use with Horvitz-Thompson and Yates-Grundy variance estimators, where "overton" indicates the Overton approximation, "hr" indicates the Hartley_Rao approximation, and "brewer" equals the Brewer approximation. The default value is "overton".

conf

Numeric value for the Gaussian-based confidence level. The default is 95.

All_Sites

A logical variable used when subpops is not NULL. If All_Sites is TRUE, then alongside the subpopulation output, output for all sites (ignoring subpopulations) is returned for each variable in vars. If All_Sites is FALSE, then alongside the subpopulation output, output for all sites (ignoring subpopulations) is not returned for each variable in vars. The default is FALSE.

Value

The analysis results. A list composed of two data frames containing trend estimates for all combinations of population Types, subpopulations within Types, and response variables. For categorical variables, trend estimates are calculated for each category of the variable. The two data frames in the output list are:

catsum: data frame containing trend estimates for categorical variables
contsum: data frame containing trend estimates for continuous variables

For the SLR and WLR model options, the data frame contains the following variables:

Type: subpopulation (domain) name
Subpopulation: subpopulation name within a domain
Indicator: response variable
Trend_Estimate: trend estimate
Trend_Std_Error: trend standard error
Trend_LCBxxPct: trend xx% (default 95%) lower confidence bound
Trend_UCBxxPct: trend xx% (default 95%) upper confidence bound
Trend_p_Value: trend p-value
Intercept_Estimate: intercept estimate
Intercept_Std_Error: intercept standard error
Intercept_LCBxxPct: intercept xx% (default 95%) lower confidence bound
Intercept_UCBxxPct: intercept xx% (default 95%) upper confidence bound
Intercept_p_Value: intercept p-value
R_Squared: R-squared value
Adj_R_Squared: adjusted R-squared value

For the GLMM and LMM model options, contents of the data frames will vary depending on the model specified by arguments cat_rhs and

cont_rhs. For the default PO model, the data frame contains the following variables:

Type: subpopulation (domain) name
Subpopulation: subpopulation name within a domain
Indicator: response variable
Trend_Estimate: trend estimate
Trend_Std_Error: trend standard error
Trend_LCBxxPct: trend xx% (default 95%) lower confidence bound
Trend_UCBxxPct: trend xx% (default 95%) upper confidence bound
Trend_p_Value: trend p-value
Intercept_Estimate: intercept estimate
Intercept_Std_Error: intercept standard error
Intercept_LCBxxPct: intercept xx% (default 95%) lower confidence bound
Intercept_UCBxxPct: intercept xx% (default 95%) upper confidence bound
Intercept_p_Value: intercept p-value
Var_SiteInt: variance of the site intercepts
Var_SiteTrend: variance of the site trends
Corr_SiteIntSlope: correlation of site intercepts and site trends
Var_Year: year variance
Var_Residual: residual variance
AIC: generalized Akaike Information Criterion

Details

For the simple linear regression (SLR) model, a design-based estimate of the category proportion (categorical variables) or the mean (continuous variables) is calculated for each time period (year). Four choices of variance estimator are available for calculating variance of the design-based estimates: (1) the local mean estimator, (2) the simple random sampling estimator, (3) the Horvitz-Thompson estimator, and (4) the Yates-Grundy estimator. For the Horvitz-Thompson and Yates-Grundy estimators, there are three choices for calculating joint inclusion probabilities: (1) the Overton approximation, (2) the Hartley-Rao approximation, and (3) the Brewer approximation. The lm function in the stats package is used to fit a linear model using a formula argument that specifies the proportion or mean estimates as the response variable and years as the regressor variable. For fitting the SLR model, the yearID variable from the dframe argument is modified by subtracting the minimum value of years from all values of the variable. Parameter estimates are extracted from the object returned by the lm function. For the weighted linear regression (WLR) model, the process is the same as the SLR model except that the inverse of the variances of the proportion or mean estimates is used as the weights argument in the call to the lm function. For the LMM option, the lmer function in the lme4 package is used to fit a linear mixed-effects model for trend across years. For both the GLMM and LMM options, the default Piepho and Ogutu (PO) model includes fixed effects for intercept and trend (slope) and random effects for intercept and trend for individual sites, where the siteID variable from the dframe argument identifies sites. Correlation between the random effects for site intercepts and site trends is included in the model. Finally, the PO model contains random effects for year variance and residual variance. For the GLMM and LMM options, arguments cat_rhs and cont_rhs, respectively, can be used to specify the right hand side of the model formula. Internally, a variable named Wyear is created that is useful for specifying the cat_rhs and cont_rhs arguments. The Wyear variable is created by subtracting the minimum value of the yearID variable from all values of the variable. If argument invprboot is FALSE, parameter estimates are extracted from the object returned by the lmer function. If argument invprboot is TRUE, the boot function in the boot package is used to generate bootstrap replicates using a function named bootfcn as the statistic argument passed to the boot function. For each bootstrap replicate, bootfcn calls the glmer or lmer function, as appropriate, using the specified model. design weights identified by the weight argument for the trend_analysis function are passed as the weights argument for the boot function, which specifies importance weights. Using the design weights as the weights argument ensures that bootstrap replicates are representative of the survey population. Parameter estimates are calculated using the object returned by the boot function.

Author

Tom Kincaid Kincaid.Tom@epa.gov

Examples

# Example using a categorical variable with three resource classes and a
# continuous variable
mydframe <- data.frame(
  siteID = rep(paste0("Site", 1:40), rep(5, 40)),
  yearID = rep(seq(2000, 2020, by = 5), 40),
  wgt = rep(runif(40, 10, 100), rep(5, 40)),
  xcoord = rep(runif(40), rep(5, 40)),
  ycoord = rep(runif(40), rep(5, 40)),
  All_Sites = rep("All Sites", 200),
  Region = sample(c("North", "South"), 200, replace = TRUE),
  Resource_Class = sample(c("Good", "Fair", "Poor"), 200, replace = TRUE),
  ContVar = rnorm(200, 10, 1)
)
myvars_cat <- c("Resource_Class")
myvars_cont <- c("ContVar")
mysubpops <- c("All_Sites", "Region")
trend_analysis(
  dframe = mydframe,
  vars_cat = myvars_cat,
  vars_cont = myvars_cont,
  subpops = mysubpops,
  model_cat = "WLR",
  model_cont = "SLR",
  siteID = "siteID",
  yearID = "yearID",
  weight = "wgt",
  xcoord = "xcoord",
  ycoord = "ycoord"
)
#> $catsum
#>        Type Subpopulation      Indicator Category Trend_Estimate
#> 1 All_Sites     All Sites Resource_Class     Fair     0.04523685
#> 2 All_Sites     All Sites Resource_Class     Good    -0.40839357
#> 3 All_Sites     All Sites Resource_Class     Poor     0.38870400
#> 4    Region         North Resource_Class     Fair     0.61223475
#> 5    Region         North Resource_Class     Good    -1.22616241
#> 6    Region         North Resource_Class     Poor     0.66950061
#> 7    Region         South Resource_Class     Fair    -1.41616234
#> 8    Region         South Resource_Class     Good     0.68421557
#> 9    Region         South Resource_Class     Poor    -0.10773624
#>   Trend_Std_Error Trend_LCB95Pct Trend_UCB95Pct Trend_p_Value
#> 1       0.3660183     -1.1195967      1.2100704     0.9094540
#> 2       0.3384541     -1.4855054      0.6687183     0.3140403
#> 3       0.3990614     -0.8812874      1.6586954     0.4018741
#> 4       0.6544452     -1.4705021      2.6949716     0.4185408
#> 5       0.9024478     -4.0981542      1.6458293     0.2673743
#> 6       0.3746057     -0.5226620      1.8616632     0.1718685
#> 7       0.9588723     -4.4677220      1.6353973     0.2362015
#> 8       0.4157430     -0.6388643      2.0072955     0.1983640
#> 9       0.8545475     -2.8272879      2.6118154     0.9076481
#>   Intercept_Estimate Intercept_Std_Error Intercept_LCB95Pct Intercept_UCB95Pct
#> 1           34.17562            4.458670         19.9861447           48.36510
#> 2           37.36444            4.169962         24.0937551           50.63512
#> 3           27.66571            4.923735         11.9961868           43.33523
#> 4           29.25989            7.928626          4.0274576           54.49231
#> 5           50.01256           10.936958         15.2062788           84.81884
#> 6           18.52645            4.481118          4.2655363           32.78737
#> 7           43.46842           13.852895         -0.6176701           87.55452
#> 8           21.01016            4.830891          5.6361042           36.38421
#> 9           38.66653           10.060016          6.6510671           70.68199
#>   Intercept_p_Value   R_Squared Adj_R_Squared
#> 1       0.004612606 0.005065849   -0.32657887
#> 2       0.002933281 0.326748926    0.10233190
#> 3       0.011145479 0.240268774   -0.01297497
#> 4       0.034506712 0.225839176   -0.03221443
#> 5       0.019623920 0.380943639    0.17459152
#> 6       0.025679670 0.515670797    0.35422773
#> 7       0.051749281 0.420988626    0.22798484
#> 8       0.022450110 0.474471974    0.29929597
#> 9       0.031074264 0.005270296   -0.32630627
#> 
#> $contsum
#>        Type Subpopulation Indicator Trend_Estimate Trend_Std_Error
#> 1 All_Sites     All Sites   ContVar   -0.004842711      0.01030439
#> 2    Region         North   ContVar   -0.013031361      0.02167446
#> 3    Region         South   ContVar    0.015256704      0.01759278
#>   Trend_LCB95Pct Trend_UCB95Pct Trend_p_Value Intercept_Estimate
#> 1    -0.03763588     0.02795046     0.6704321          10.118495
#> 2    -0.08200918     0.05594645     0.5900801          10.279755
#> 3    -0.04073137     0.07124478     0.4496260           9.849746
#>   Intercept_Std_Error Intercept_LCB95Pct Intercept_UCB95Pct Intercept_p_Value
#> 1           0.1262025           9.716862           10.52013      4.276453e-06
#> 2           0.2654569           9.434953           11.12456      3.788461e-05
#> 3           0.2154667           9.164035           10.53546      2.304561e-05
#>    R_Squared Adj_R_Squared
#> 1 0.06857402   -0.24190130
#> 2 0.10753566   -0.18995246
#> 3 0.20043923   -0.06608102
#>

Arguments

Value

Details

See also

Author

Examples