Access and harmonize macroinvertebrate data — getInvertData • finsyncR

This function generates an occurrence or abundance community matrix for benthic macroinvertebrates sampled in rivers and streams from the US EPA National Rivers and Streams Assessment and USGS BioData.

Usage

getInvertData(
  dataType = "occur",
  taxonLevel = "Genus",
  taxonFix = "lump",
  agency = c("USGS", "EPA"),
  lifestage = FALSE,
  rarefy = TRUE,
  rarefyCount = 300,
  sharedTaxa = FALSE,
  seed = 0,
  boatableStreams = FALSE
)

Arguments

dataType: Output data type for the community matrix, either "density" (density) or "occur" (occurrence).
taxonLevel: Level of taxonomic resolution for the community matrix. Input must be one of: "Phylum", "Class", "Order", "Family", "Subfamily", "Genus", or "Mixed".
taxonFix: Option to account for changes in taxonomy across time, must be one of: "none", "lump", "remove". See Details below for more information.
agency: The agency name or names (e.g., "USGS" and "EPA") that are the source of data for the output community matrix. See Details below for more information.
lifestage: logical. For USGS data only, should the output dataset include lifestage information for each taxa? TRUE or FALSE. Default is FALSE.
rarefy: logical. Should samples be standardized by the number of individuals identified within each sample? TRUE or FALSE. See Details below for more information.
rarefyCount: integer. If rarefy = TRUE, the individual count to be used as the cutoff for rarefaction (standardizing samples by the number of individuals identified). Default is 300.
sharedTaxa: logical. Should Genera be limited to those that appear in both the EPA and USGS datasets? TRUE or FALSE. Must be set to FALSE when only one agency is specified. Default is FALSE
seed: numeric. Set seed for rarefy to get consistent results with each new iteration of the function. Value gets passed to set.seed internally.
boatableStreams: logical. Should EPA boatable streams be included in the output dataset? TRUE or FALSE. Note: all USGS streams are wadable. It is not advisable to include boatable streams when building a dataset including both EPA and USGS data. Boatable EPA data and wadeable USGS data are not considered comparable.

Value

A taxa by sample data frame with site, stream reach, and sample information.

Details

agency refers to the agency that collected the invertebrate samples. If you want to use data from both agencies, set agency = c("USGS", "EPA"), which is the default. For USGS data, sampling data include all USGS BioData with SampleMethodCode of "BERW", "IRTH", "SWAMP", "EMAP", "CDPHE RR", and "PNAMP". Note that by default, only moving waters classified as "wadeable" are included but setting boatableStreams = TRUE will include observations from non-wadeable streams. Some information included in the EPA dataset are not included in the USGS datasets, specifically observed wetted width of the stream/river.

taxonLevel refers to the taxonomic resolution (Genus, Class, Family, etc.) for the sample by taxa matrix. The input values for this parameter are case sensitive and must start with a capital letter. All observations taxonomically coarser than the taxonLevel provided are dropped from the output community matrix. For instance, if taxonLevel = "Genus" , then observations identified at Subfamily, Family, Order, Class, or Phylum levels are dropped. When taxonLevel = "Subfamily", for taxa without subfamilies, the Family-level designation is returned. When taxonLevel = "Subfamily", for taxa without subfamilies, the Family-level designation is returned. "Genus" is the finest level of taxonomic resolution provided for macroinvertebrates. The function also provides the option of returning the lowest level of taxonomic identification for all specimen (taxonLevel = "Mixed").

taxonFix provides options to account for changes in taxonomy across time, especially in instances in which species have been reorganized into new genera. taxonFix operates on the genera level. taxonFix = "none" makes no adjustment. taxonFix = "lump" prioritizes retaining observations by giving a unified "slash" genera name to all species and genera that have been linked through changes in taxonomy through time (e.g. genera1/genera2/genera3). Note: of 82 problematic genera that exist throughout both datasets, taxonFix = "lump" results in 11 "lumped" genera. All but two complexes of genera were composed of two individual genera. A single complex of genera within the Ephemeroptera order included 70 individual genera, and a second complex of Ephemeroptera genera included six individual genera. Because of the complexity of Ephemeroptera taxonomy, careful consideration should be given to inferences that can be made when evaluating Ephemeroptera trends. The authors, for instance, do not advise that users evaluate temporal changes in abundance or richness of the linked groups of Ephemeroptera genera because taxonomic reorganizations likely obscure temporal patterns.

Alternatively, genera linked by taxonomic reorganization can be removed with taxonFix = "remove". This option prioritizes accurate identification by dropping observations that cannot be confidently identified to a single genus, as in the complexes of genera previously described. Without a species-level identification, there is no way to assure correct membership in an updated genus. Organisms with a species-level identification are cross-walked to an updated genus. NOTE on “slash” genera: When taxonFix = "lump", these "slash" genera are rolled into the larger linked genera, as above. taxonFix = "remove" prioritizes accurate identifications by dropping all slash genera are, including those organisms identified as a “slash” genus in the lab; this option will result in many fewer genera in the final dataset. Finally, taxonFix = "none", includes "slash" genera, but it does not connect these genera to larger linked genera.

taxonFix operates on the genus level, so set taxonFix = "none" when taxonLevel is set to "Family" or higher taxonomic resolution. Care should be taken to harmonize taxonomy either with the approaches provided or some alternative when long time scale datasets on the entire community of macroinvertebrates are generated because changes in taxonomy can make it artificially appear as though some genera are either appearing or disappearing in time. See vignette("GettingStarted") for more information.

If rarefy = TRUE, samples with rarefyCount+ individuals identified (raw count) are retained. Thus, a percentage of samples will be removed, as they have <rarefyCount individuals sampled. The rarefaction threshold is default is 300 organisms, because 1) with every 50 individuals identified, ~1 genera are added to the sample and 2) 91.3 \ at least 300 individuals identified. Thus, lowering the threshold to 200 individuals removed ~2 genera per sample, but only an additional 3.2 \ samples are included (94.5 \ to 400 individuals added ~2 genera per sample, but reduced samples to 70.6 \ all samples. Use seed = ... to get consistent output of community data. See vignette("GettingStarted") for more information regarding rarefaction. NOTE: rarefy = TRUE can be used when a user wants occurrence data (presence/absence) OR proportional data (each taxon represents a certain proportion of a sample). Use rarefy = FALSE when densities are the measure of interest.

When dataType = "density", the function calculates taxa densities from samples using lab subsampling ratios and area sampled $$Taxa~abundance = n*frac{1}{PropID}$$ $$Taxa~density = frac{Taxa~abundance}{Area~sampled~(m^2)}$$ where n is the number of specimens identified and PropID is the proportion of the sample that was identified at the lab bench. For the USGS dataset, this incorporates both "field split ratio" (proportion of the sample that was brought into the lab for specimen identification) and the "lab subsampling ratio" (proportion of grids used to identify invertebrates at the lab bench). For the EPA datasets, this is just the "lab subsampling ratio", the proportion of grids used to identify invertebrates at the lab bench. See vignette("GettingStarted") for more details on the calculation of taxa densities.

Author

Michael Mahon, Devin Jones, Samantha Rumschlag, Terry Brown

Examples

if (FALSE) {
Inverts <- getInvertData(taxonLevel = "Genus")

RarefyInverts <- getInvertData(taxonLevel = "Genus",
                               rarefy = TRUE,
                               seed = 10)
}