TADA: WaterSciCon Workshop Demo
TADA Team
2024-11-15
Source:vignettes/articles/TADAWaterSciConWorkshopDemo.Rmd
TADAWaterSciConWorkshopDemo.Rmd
Overview
We will walk through the specific use case of query, downloading, preparing, and QA/QCing data from the San Juan and Animas Rivers near Farmington, New Mexico with a focus on total dissolved solids and conductivity.
This vignette is designed to be concise and provide an example of how you might use TADA functions for basic data discovery and preparation. It leverages functions from TADA Modules 1 and 2. For more details on each function, see the TADA Function Reference or the more comprehensive module vignettes:
Load Required Packages
The fist step is to load the TADA package and other packages used in our workflow. (If you are a first-time TADA user, see TADA Module 1 Beginner Training for TADA package installation instructions.)
We’ll also record the start time for our analysis, beginning after loading packages.
How’s My Waterway
To help define the area of interest, we can start by taking a look at How’s My Waterway by searching for Farmington, New Mexico: HMW_Farmington, NM. We can identify the Assessment Unit Identifiers for the segments of the San Juan and Animas Rivers closest to their confluence. These are:
1) San Juan River (Navajo bnd at Hogback to Animas River) - NM-2401_10
2) Animas River (San Juan River to Estes Arroyo) - NM-2403.A_00
We can also identify the HUC-12s for each segment. This information will help us build our Water Quality Portal query to find results for the segments of interest plus other nearby water bodies. The HUC-12s are 140801050505 (San Juan River) and 140801041006 (Animas River).
Data Retrieval - Water Chemistry
The water chemistry data we will use are downloaded from the Water Quality Portal (https://www.waterqualitydata.us/). The WQP integrates publicly available water quality data from the USGS National Water Information System (NWIS) and the EPA Water Quality Exchange (WQX) Data Warehouse. The EPA water quality data originate from the Water Quality Exchange, the EPA’s repository of water quality monitoring data collected by water resource management groups across the country. Organizations, including states, tribes, watershed groups, other federal agencies, volunteer groups, and universities, submit data to the WQX.
One of the fields that can be used to query the WQP is HUC. As we
know the HUC-12s of interest, we will use them to retrieve WQP data
using TADA_DataRetrieval
. As we are most interested in
total dissolved solids and conductivity and these are in the “Physical”
characteristic type group in the WQP, we will also include
characteristic group type in our query. We will not set start and end
dates as we are interested in all available data for these locations and
characteristics. .
# Import data from WQP
data <- TADA_DataRetrieval(
statecode = "NM",
huc = c("140801050505", "140801041006"),
characteristicType = "Physical",
applyautoclean = TRUE
)
## [1] "Downloading WQP query results. This may take some time depending upon the query size."
## $statecode
## [1] "US:35"
##
## $huc
## [1] "140801050505" "140801041006"
##
## $characteristicType
## [1] "Physical"
## [1] "Data successfully downloaded. Running TADA_AutoClean function."
## [1] "TADA_Autoclean: creating TADA-specific columns."
## [1] "TADA_Autoclean: harmonizing dissolved oxygen characterisic name to DISSOLVED OXYGEN SATURATION if unit is % or % SATURATN."
## [1] "TADA_Autoclean: handling special characters and coverting TADA.ResultMeasureValue and TADA.DetectionQuantitationLimitMeasure.MeasureValue value fields to numeric."
## [1] "TADA_Autoclean: converting TADA.LatitudeMeasure and TADA.LongitudeMeasure fields to numeric."
## [1] "TADA_Autoclean: harmonizing synonymous unit names (m and meters) to m."
## [1] "TADA_Autoclean: updating deprecated (i.e. retired) characteristic names."
## [1] "No deprecated characteristic names found in dataset."
## [1] "TADA_Autoclean: harmonizing result and depth units."
## [1] "TADA_Autoclean: creating TADA.ComparableDataIdentifier field for use when generating visualizations and analyses."
## [1] "NOTE: This version of the TADA package is designed to work with numeric data with media name: 'WATER'. TADA_AutoClean does not currently remove (filter) data with non-water media types. If desired, the user must make this specification on their own outside of package functions. Example: dplyr::filter(.data, TADA.ActivityMediaName == 'WATER')"
AutoClean
As part of TADA_DataRetrieval
,
TADA_AutoClean
is performed automatically unless specified
otherwise by the user (by setting applyautoclean = FALSE). When
TADA_AutoClean
is run, the following functions are
performed on the data retrieved from the WQP:
TADA_ConvertSpecialChars
- converts result value columns to numeric and flags non-numeric values that could not be converted.TADA_ConvertResultUnits
- unifies result units for easier quality control and reviewTADA_ConvertDepthUnits
- converts depth units to a consistent unit (meters).TADA_IDCensoredData
- categorizes detection limit data and identifies mismatches in result detection condition and result detection limit type.Other helpful actions - converts important text columns to all upper-case letters, and uses WQX format rules to harmonize specific NWIS metadata conventions (e.g. move characteristic speciation from the TADA.ResultMeasure.MeasureUnitCode column to the TADA.MethodSpeciationName column)
As a general rule, TADA functions do not change any contents in the WQP-served columns. Instead, they add new columns with the prefix “TADA.” This allows users to easily review any changes TADA functions have made.
Review and Filter Results By Comparable Data Identifier
Now we can review the results and decide which ones we would like to retain for further QC and analysis. First let’s take a look at all unique values in the TADA.ComparableDataIdentifier column and see how how many results are associated with each. TADA.ComparableDataIdentifier concatenates TADA.CharacteristicName, TADA.ResultSampleFractionText, TADA.MethodSpeciationName, and TADA.ResultMeasure.MeasureUnitCode.
# use TADA_FieldValuesTable to create a table of the number of results per TADA.ComparableDataIdentifier
compid_table <- TADA_FieldValuesTable(data, field = "TADA.ComparableDataIdentifier")
DT::datatable(compid_table, fillContainer = TRUE)
As we scroll through the table, we can see that there are some
repeated characteristic names. It may be possible that some of these are
synonyms which can be harmonized using
TADA_HarmonizeSynonyms
so their results can be directly
compared. However, for purposes of this demo we will not explore
harmonization and will instead focus on the two
TADA.ComparableDataIdentifiers with the greatest number of results,
“TOTAL DISSOLVED SOLIDS_DISSOLVED_NA_UG/L” and “SPECIFIC
CONDUCTANCE_TOTAL_NA_US/CM @25C”. For
information on harmonization, see the EPATADA
Function Reference.
Now, we can filter the data set to retain only the TADA.ComparableDataIdentifiers of interest.
Water Chemistry Data Preparation and QC
Before performing any analysis, there are some additional data preparation and QC steps that can be performed using TADA functions.
First, we can check the data for additional conditions which might
cause us to exclude certain results from further analysis. For example,
we can use TADA_FlagMethod
to check for invalid analytical
method-characteristic combinations; TADA_FlagSpeciation
to
check for invalid characteristics-method speciation combinations,
TADA_FlagResultUnit
to check the the validity of each
characteristic-media-result unit combination, and
TADA_FlagFraction
to check the validity of
characteristic-fraction combinations. In this simple example, we will
remove all invalid combinations identified by each of these functions.
When setting up your own workflow, you will want to review each function
and its output more carefully before moving on.
data <- data %>%
TADA_FlagMethod(clean = TRUE) %>%
TADA_FlagSpeciation(clean = "both") %>%
TADA_FlagResultUnit(clean = "both") %>%
TADA_FlagFraction(clean = TRUE)
## [1] "No Suspect method/characteristic combinations in your dataframe. Returning the input dataframe with TADA.AnalyticalMethod.Flag column for tracking."
## [1] "All characteristic/fraction combinations are valid in your dataframe. Returning input dataframe with TADA.SampleFraction.Flag column for tracking."
Another set of TADA flagging functions,
TADA_FlagAboveThreshold
and
TADA_FlagBelowThreshold
, can be used to check results
against national lower and upper thresholds. As in the previous
examples, we will set these functions to remove any results that fall
outside the national thresholds.
data <- data %>%
TADA_FlagAboveThreshold(clean = TRUE) %>%
TADA_FlagBelowThreshold(clean = TRUE)
## [1] "No data below the WQX Lower Threshold were found in your dataframe. Returning the input dataframe with TADA.ResultValueBelowLowerThreshold.Flag column for tracking."
We can also review the data set to see if it contains potential duplicate results from within a single organization or potential duplicates from within multiple organizations (such as when two or more organizations monitor the same location and may submit duplicate results).
We will select to keep only unique samples from
TADA_FindPotentialDuplicatesSingleOrg
by filtering for
TADA.SingleOrgDup.Flag equals “Unique”. Then we will search for
potential duplicates from multiple orgs with
TADA_FindPotentialDuplicatesMultipleOrgs
and filter only to
retain only one value per potential duplicate by filtering for
TADA.ResultSelectedMultipleOrgs equals “Y”.
If you would like to prioritize results from one organization over
another, this can be done using the org_hierarchy argument in
TADA_FindPotentialDuplicatesMultipleOrgs
. Information on
how to this can be found in the function reference for TADA_FindPotentialDuplicatesMultipleOrgs.
In this example, we will use the default setting “none” where a single
representative is selected randomly from each duplicate group.
data <- data %>%
TADA_FindPotentialDuplicatesSingleOrg() %>%
dplyr::filter(TADA.SingleOrgDup.Flag == "Unique") %>%
TADA_FindPotentialDuplicatesMultipleOrgs(
dist_buffer = 100,
org_hierarchy = "none"
) %>%
dplyr::filter(TADA.ResultSelectedMultipleOrgs == "Y")
## [1] "39 groups of potentially duplicated results found in dataset. These have been placed into duplicate groups in the TADA.SingleOrgDupGroupID column and the function randomly selected one result from each group to represent a single, unduplicated value. Selected values are indicated in the TADA.SingleOrgDup.Flag as 'Unique', while duplicates are flagged as 'Duplicate' for easy filtering."
## [1] "No nearby sites detected using input buffer distance."
## [1] "No duplicate results detected. Returning input dataframe with duplicate flagging columns set to 'N'."
Next we can remove QC and qualified samples (with qualifiers
identified as suspect) using TADA_FindQCActivities
and
TADA_FlagMeasureQualifierCode
, respectively. See the TADA
Module 1 vignettes for more information on how these functions identify
and flag results for removal as you may wish to make different decisions
than the TADA defaults regarding which results to retain.
data <- data %>%
TADA_FindQCActivities(clean = TRUE) %>%
TADA_FlagMeasureQualifierCode(clean = TRUE)
## [1] "ActivityTypeCode column in dataset contains value(s) Quality Control Sample-Spike Solution which is/are not in the ActivityType WQX domain table. These data records are USGS only values and placed under the TADA.ActivityType.Flag: 'Not Reviewed'. Please review these carefully to detemine data usability."
## [1] "Quality control samples have been removed or were not present in the input dataframe. Returning dataframe with TADA.ActivityType.Flag column for tracking."
Censored Data
We can use a TADA function to identify any censored data in the data set.
data <- TADA_IDCensoredData(data)
## [1] "TADA_IDCensoredData: No censored data detected in your data frame. Returning input dataframe with new column TADA.CensoredData.Flag set to Uncensored"
Because there are no censored data in the data set, we do not need to
take any further action. If censored data do exist,
TADA_SimpleCensoredMethods
could be used. See the TADA
Module 1 vignettes or for more information.
Data Retrieval - Geospatial
All geospatial data used in this example are downloaded from the Assessment Total Maximum Daily Load (TMDL) Tracking and Implementation System (ATTAINS) (https://www.epa.gov/waterdata/attains). ATTAINS is an online system for accessing information about conditions in the Nation’s surface waters.
This information reported to EPA by states is available in ATTAINS. The public information is made available via ATTAINS web services. New TADA functions leverage the ATTAINS geospatial services to make geospatial information including catchment and assessment unit geometry easily accessible to R users.
We can use the function, TADA_GetATTAINS
to obtain
geospatial data from ATTAINS relevant to the Monitoring
Locations included in the WQP data set. See TADA
Module 2 for a much more detailed look at the logic behind this and
the other TADA geospatial functions.
# Import data from ATTAINS geospatial services
ATTAINS_data <- TADA_GetATTAINS(data)
## [1] "Transforming your data into a spatial object."
## [1] "Depending on your data's observation count and its spatial range, the ATTAINS pull may take a while."
View Geospatial Features
The TADA_ViewATTAINS
function allows us to see where the
monitoring locations from the WQP data set are relative
to ATTAINS-indexed catchments and assessment units.
This can be helpful when deciding which Monitoring Locations should be
retained for additional analysis. It can also allow us to sub
# View catchments and assessment units on map
ATTAINS_map <- TADA_ViewATTAINS(ATTAINS_data)
ATTAINS_map
Monitoring Location Filter and Review
Now that we’ve associated geospatial data from ATTAINS with the WQP data, we can generate pie charts for each TADA.ComparableDataIdentifier to give us some information about how many results are available per each monitoring location group.
# Select TADA data with ATTAINS data
data <- ATTAINS_data$TADA_with_ATTAINS %>%
# Remove geometry to reduce size of data set
sf::st_drop_geometry() %>%
# Assing "Other" as name for unnamed assessment units
dplyr::mutate(
ATTAINS.assessmentunitname =
ifelse(is.na(ATTAINS.assessmentunitname),
"Other", ATTAINS.assessmentunitname
)
)
# Create pie chart of results for each AUID group for total dissolved solids
TADA_FieldValuesPie(data,
field = "ATTAINS.assessmentunitname",
characteristicName = "TOTAL DISSOLVED SOLIDS"
)
# Create pie chart of results for each AUID group for specific conductance
TADA_FieldValuesPie(data,
field = "ATTAINS.assessmentunitname",
characteristicName = "SPECIFIC CONDUCTANCE"
)
We can also create a table to give us a little more information about assessment unit each monitoring location belongs to, the number of results per each parameter and the oldest and most recent sampling date.
# Create table of monitoring location identifiers
MonitoringLocations <- data %>%
dplyr::select(
MonitoringLocationName, MonitoringLocationIdentifier, OrganizationFormalName,
ATTAINS.assessmentunitname, TADA.CharacteristicName, ActivityStartDate
) %>%
dplyr::group_by(MonitoringLocationIdentifier) %>%
dplyr::mutate(
n_TDS = length(TADA.CharacteristicName[TADA.CharacteristicName == "TOTAL DISSOLVED SOLIDS"]),
n_SpecificConductance = length(TADA.CharacteristicName[TADA.CharacteristicName == "SPECIFIC CONDUCTANCE"]),
StartDate = min(ActivityStartDate),
EndDate = max(ActivityStartDate)
) %>%
dplyr::select(
MonitoringLocationIdentifier,
ATTAINS.assessmentunitname,
n_TDS,
n_SpecificConductance,
StartDate,
EndDate
) %>%
dplyr::distinct()
DT::datatable(MonitoringLocations, fillContainer = TRUE)
Now, let’s create subsets of the data by assessment unit name. These will be useful to create some basic data visualizations.
data_animas <- data %>%
dplyr::filter(ATTAINS.assessmentunitname == "Animas River (San Juan River to Estes Arroyo)")
data_sanjuan <- data %>%
dplyr::filter(ATTAINS.assessmentunitname == "San Juan River (Navajo bnd at Hogback to Animas River)")
data_other <- data %>%
dplyr::filter(ATTAINS.assessmentunitname == "Other")
Data Visualization
First, let’s create a two characteristic scatterplot of total
dissolved solids and specific conductance from the entire data set to
begin visualizing the data, using
TADA_TwoCharacteristicScatterplot
.
# choose two and generate scatterplot
TADA_TwoCharacteristicScatterplot(data, id_cols = "TADA.ComparableDataIdentifier", groups = c("SPECIFIC CONDUCTANCE_TOTAL_NA_US/CM @25C", "TOTAL DISSOLVED SOLIDS_DISSOLVED_NA_UG/L"))
While there are some older records (including a specific conductance result from 1900), we may decide we are interested only in more recent data. We could filter the data set to exclude any results collected prior to 2015 and recreate the scatterplot.
# choose two and generate scatterplot
TADA_TwoCharacteristicScatterplot(
data %>%
dplyr::filter
(ActivityStartDate > "2014-12-31"),
id_cols = "TADA.ComparableDataIdentifier",
groups = c(
"SPECIFIC CONDUCTANCE_TOTAL_NA_US/CM @25C",
"TOTAL DISSOLVED SOLIDS_DISSOLVED_NA_UG/L"
)
)
We could also choose to explore one characteristic from one of the assessment units more closely. For example, we could create a histograms for both characteristics for the Animas River only, excluding samples prior to 2014.
TADA_Histogram(
data %>%
dplyr::filter(
ActivityStartDate > "2014-12-31",
TADA.CharacteristicName ==
"SPECIFIC CONDUCTANCE",
ATTAINS.assessmentunitname == "Animas River (San Juan River to Estes Arroyo)"
),
id_cols = "TADA.ComparableDataIdentifier"
)
Or we could create a boxplot for a particular characteristic and assessment unit, for example, total dissolved solids for the San Juan River.
TADA_Boxplot(
data %>%
dplyr::filter(
ActivityStartDate > "2014-12-31",
TADA.CharacteristicName ==
"SPECIFIC CONDUCTANCE",
ATTAINS.assessmentunitname == "San Juan River (Navajo bnd at Hogback to Animas River)"
),
id_cols = "TADA.ComparableDataIdentifier"
)
We might also want to directly compare the same characteristic
overtime between the two rivers. We can do this by revisiting
TADA.TwoCharacteristicScatterplot.
# create two characteristic scatterplot using TADA_TWoCharacteristicScatterplot
twochar_scatter <- TADA_TwoCharacteristicScatterplot(
data %>%
dplyr::filter(
ActivityStartDate > "2014-12-31",
TADA.ComparableDataIdentifier ==
"SPECIFIC CONDUCTANCE_TOTAL_NA_US/CM @25C"
),
id_cols = "ATTAINS.assessmentunitname",
groups = c("San Juan River (Navajo bnd at Hogback to Animas River)", "Animas River (San Juan River to Estes Arroyo)")
) %>%
# remove default plot features that are not applicable for a location comparison
plotly::layout(
yaxis2 = list(overlaying = "y", side = "right", title = "", visible = FALSE),
title = TADA_InsertBreaks("SPECIFIC CONDUCTANCE for the San Juan and Animas Rivers Over Time")
)
## [1] "Note: TADA.ComparableDataIdentifier not found in id_cols argument and is highly recommended."
# modify plot titles to reflect group/location names
twochar_scatter <- plotly::plotly_build(twochar_scatter)
twochar_scatter$x$data[[2]]$name <- "San Juan River"
twochar_scatter$x$data[[3]]$name <- "Animas River"
# rebuild graph with corrected legend labels
plotly::plotly_build(twochar_scatter)
We can also use TADA_Stats
to generate a summary table
for the entire data set or subsets of it for each river.
# create table with all data
data_stats <- TADA_Stats(data)
DT::datatable(data_stats, fillContainer = TRUE)
# create table with animas data
animas_stats <- TADA_Stats(data %>% dplyr::filter(ATTAINS.assessmentunitname == "Animas River (San Juan River to Estes Arroyo)"))
DT::datatable(animas_stats, fillContainer = TRUE)
# create table with san juan data
sanjuan_stats <- TADA_Stats(data %>% dplyr::filter(ATTAINS.assessmentunitname == "San Juan River (Navajo bnd at Hogback to Animas River)"))
DT::datatable(sanjuan_stats, fillContainer = TRUE)
Reproducible and Documented
This workflow is reproducible and the decisions at each step are well documented. This means that it is easy to go back and review every step, understand the decisions that were made, make changes as necessary, and run it again.
For example, we could modify the code to repeat the same analysis at
a different location. Or we could run
TADA_ConvertResultUnits
an additional time outside to
convert units for one or more characteristics.
We could change or add additional characteristics to our analysis. We could even write additional custom functions or even use functions from other packages as needed for more complex standards, evaluate trends, or answer other questions using a TADA data frame.
The code chunk below displays the elapsed time it took to run the code chunks and create this document, providing an example of how incorporating TADA in your workflow can increase efficiency.
end.time <- Sys.time()
end.time - start.time
## Time difference of 1.104267 mins