
Access and harmonize macroinvertebrate data
getInvertData.Rd
This function generates an occurrence or abundance community matrix for benthic macroinvertebrates sampled in rivers and streams from the US EPA National Rivers and Streams Assessment and USGS BioData.
Usage
getInvertData(
dataType = "occur",
taxonLevel = "Genus",
taxonFix = "lump",
agency = c("USGS", "EPA"),
lifestage = FALSE,
rarefy = TRUE,
rarefyCount = 300,
sharedTaxa = FALSE,
seed = 0,
boatableStreams = FALSE
)
Arguments
- dataType
Output data type for the community matrix, either
"density"
(density) or"occur"
(occurrence).- taxonLevel
Level of taxonomic resolution for the community matrix. Input must be one of:
"Phylum"
,"Class"
,"Order"
,"Family"
,"Subfamily"
,"Genus"
, or"Mixed"
.- taxonFix
Option to account for changes in taxonomy across time, must be one of:
"none"
,"lump"
,"remove"
. SeeDetails
below for more information.- agency
The agency name or names (e.g., "USGS" and "EPA") that are the source of data for the output community matrix. See
Details
below for more information.- lifestage
logical. For USGS data only, should the output dataset include lifestage information for each taxa?
TRUE
orFALSE
. Default isFALSE
.- rarefy
logical. Should samples be standardized by the number of individuals identified within each sample?
TRUE
orFALSE
. SeeDetails
below for more information.- rarefyCount
integer. If
rarefy = TRUE
, the individual count to be used as the cutoff for rarefaction (standardizing samples by the number of individuals identified). Default is 300.- sharedTaxa
logical. Should Genera be limited to those that appear in both the EPA and USGS datasets?
TRUE
orFALSE
. Must be set toFALSE
when only one agency is specified. Default isFALSE
- seed
numeric. Set seed for
rarefy
to get consistent results with each new iteration of the function. Value gets passed toset.seed
internally.- boatableStreams
logical. Should EPA boatable streams be included in the output dataset?
TRUE
orFALSE
. Note: all USGS streams are wadable. It is not advisable to include boatable streams when building a dataset including both EPA and USGS data. Boatable EPA data and wadeable USGS data are not considered comparable.
Details
agency
refers to the agency
that collected the invertebrate samples.
If you want to use data from both agencies, set agency
= c("USGS", "EPA"),
which is the default. For USGS data, sampling data include all USGS BioData
with SampleMethodCode
of "BERW", "IRTH", "SWAMP", "EMAP", "CDPHE RR", and "PNAMP".
Note that by default, only moving waters classified as "wadeable" are
included but setting boatableStreams = TRUE
will include observations from
non-wadeable streams. Some information included in the EPA dataset are not
included in the USGS datasets, specifically observed wetted width of the
stream/river.
taxonLevel
refers to the taxonomic resolution (Genus, Class, Family, etc.)
for the sample by taxa matrix. The input values for this parameter are case
sensitive and must start with a capital letter. All observations taxonomically
coarser than the taxonLevel
provided are dropped from the output community matrix.
For instance, if taxonLevel = "Genus"
, then observations identified at
Subfamily, Family, Order, Class, or Phylum levels are dropped. When
taxonLevel = "Subfamily"
, for taxa without subfamilies, the Family-level
designation is returned. When taxonLevel = "Subfamily"
, for taxa without subfamilies, the Family-level
designation is returned. "Genus" is the finest level of taxonomic resolution
provided for macroinvertebrates. The function also provides the option of returning
the lowest level of taxonomic identification for all specimen (taxonLevel = "Mixed"
).
taxonFix
provides options to account for changes in taxonomy across time,
especially in instances in which species have been reorganized into new genera.
taxonFix
operates on the genera level. taxonFix = "none"
makes no adjustment. taxonFix = "lump"
prioritizes retaining observations
by giving a unified "slash" genera name to all species and genera that have been linked
through changes in taxonomy through time (e.g. genera1/genera2/genera3). Note:
of 82 problematic genera that exist throughout both datasets,
taxonFix = "lump"
results in 11 "lumped" genera. All but two complexes
of genera were composed of two individual genera. A single complex of genera
within the Ephemeroptera order included 70 individual genera, and a second
complex of Ephemeroptera genera included six individual genera. Because of
the complexity of Ephemeroptera taxonomy, careful consideration should be
given to inferences that can be made when evaluating Ephemeroptera trends.
The authors, for instance, do not advise that users evaluate temporal changes
in abundance or richness of the linked groups of Ephemeroptera genera because
taxonomic reorganizations likely obscure temporal patterns.
Alternatively, genera linked by taxonomic reorganization can be removed with
taxonFix = "remove"
. This option prioritizes accurate identification by
dropping observations that cannot be confidently identified to a single genus,
as in the complexes of genera previously described. Without a species-level
identification, there is no way to assure correct membership in an updated genus.
Organisms with a species-level identification are cross-walked to an updated genus.
NOTE on “slash” genera: When taxonFix = "lump"
, these "slash"
genera are rolled into the larger linked genera, as above. taxonFix = "remove"
prioritizes accurate identifications by dropping all slash genera are, including
those organisms identified as a “slash” genus in the lab; this option will
result in many fewer genera in the final dataset. Finally, taxonFix = "none"
,
includes "slash" genera, but it does not connect these genera to larger linked genera.
taxonFix
operates on the genus level, so set taxonFix = "none"
when taxonLevel is set to "Family"
or higher taxonomic resolution.
Care should be taken to harmonize taxonomy either with the approaches provided
or some alternative when long time scale datasets on the entire community
of macroinvertebrates are generated because changes in taxonomy can make it
artificially appear as though some genera are either appearing or disappearing
in time. See vignette("GettingStarted")
for more information.
If rarefy = TRUE
, samples with rarefyCount
+ individuals
identified (raw count) are retained. Thus, a percentage of samples will be
removed, as they have <rarefyCount
individuals sampled. The rarefaction
threshold is default is 300 organisms, because 1) with every 50 individuals
identified, ~1 genera are added to the sample and 2) 91.3 \
at least 300 individuals identified. Thus, lowering the threshold to 200
individuals removed ~2 genera per sample, but only an additional 3.2 \
samples are included (94.5 \
to 400 individuals added ~2 genera per sample, but reduced samples to 70.6 \
all samples. Use seed = ...
to get consistent output of community data.
See vignette("GettingStarted")
for more information regarding
rarefaction. NOTE: rarefy = TRUE
can be used when a user wants
occurrence data (presence/absence) OR proportional data (each taxon represents
a certain proportion of a sample). Use rarefy = FALSE
when densities
are the measure of interest.
When dataType = "density", the function calculates taxa densities from samples
using lab subsampling ratios and area sampled
$$Taxa~abundance = n*frac{1}{PropID}$$
$$Taxa~density = frac{Taxa~abundance}{Area~sampled~(m^2)}$$
where n is the number of specimens identified and PropID is the
proportion of the sample that was identified at the lab bench. For
the USGS dataset, this incorporates both "field split ratio" (proportion of
the sample that was brought into the lab for specimen identification) and
the "lab subsampling ratio" (proportion of grids used to identify invertebrates
at the lab bench). For the EPA datasets, this is just the "lab subsampling ratio", the
proportion of grids used to identify invertebrates at the lab bench. See
vignette("GettingStarted")
for more details on the calculation of
taxa densities.