TADA: Assessment Unit Use Case Demo
TADA Team
2025-03-18
Source:vignettes/TADAAssessmentUnitUseCase.Rmd
TADAAssessmentUnitUseCase.Rmd
Overview
We will walk through the specific use case of query, downloading, preparing, QA/QCing data from a single Assessment Unit with multiple Monitoring Locations sampled by several organizations. Three characteristics, pH, water temperature, and total nitrogen are used in this example. Two of the characteristics, pH and water temperature, will be compared to a state water quality standard from Illinois. The third characteristic, total nitrogen, does not have an Illinois water quality standard, but is used to demonstrate some helpful functions for data exploration and visualization.
This vignette is designed to be concise and provide an example of how you might use TADA functions for basic data discovery and analysis with regards to a single Assessment Unit and a small subset of characteristics. It leverages functions from TADA Modules 1 and 2. For more details on each function, see the TADA Function Reference or the more comprehensive module vignettes:
Install and load packages
First, install and load the remotes package specifying the repo. This is needed before installing TADA because it is only available on GitHub (not CRAN).
install.packages("remotes",
repos = "http://cran.us.r-project.org"
)
library(remotes)
Next, install and load EPATADA using the remotes package.
remotes::install_github("USEPA/EPATADA",
ref = "develop",
dependencies = TRUE
)
Load Required Packages
The fist step is to load the TADA package and other packages used in our workflow. (If you are a first-time TADA user, see TADA Module 1 Beginner Training for TADA package installation instructions.)
We’ll also record the start time for our analysis, beginning after loading packages.
list.of.packages <- c(
"dplyr", "yaml", "lwgeom", "sf", "lubridate",
"knitr", "DT"
)
new.packages <- list.of.packages[!(list.of.packages %in%
installed.packages()[, "Package"])]
if (length(new.packages)) install.packages(new.packages)
# Load required packages
library(dplyr)
library(yaml)
library(lwgeom)
library(sf)
library(lubridate)
library(knitr)
library(DT)
# Record start time
start.time <- Sys.time()
Data Retrieval - Water Chemistry
The water chemistry data we will use are downloaded from the Water Quality Portal (https://www.waterqualitydata.us/). The WQP integrates publicly available water quality data from the USGS National Water Information System (NWIS) and the EPA Water Quality Exchange (WQX) Data Warehouse. The EPA water quality data originate from the Water Quality Exchange, the EPA’s repository of water quality monitoring data collected by water resource management groups across the country. Organizations, including states, tribes, watershed groups, other federal agencies, volunteer groups, and universities, submit data to the WQX.
We are interested in Assessment Unit “IL_I-84”. This is a Mississippi River Assessment Unit. Because we don’t have a comprehensive list of the Monitoring Locations in this Assessment Unit, we will need to create a broad query to retrieve all results that might be associated the IL_I-84. One way to do this is to include all HUC10s that are relevant to the Assessment Unit.
We can use TADA_DataRetrieval
to retrieve WQP data. In
this example, we have specified lists of HUCs and Characteristic Types
to reduce the size of the query since we have a general idea of the
location of our assessment unit of interest and know which
characteristics we’d like to explore. One quick way to find this
information is to view IL_I-84’s Waterbody Report on How’s My Waterway:
WaterbodyReport/IL_EPA/IL_I-84/2022.
We’ll also need to set start and end dates. For this example we’ll use a ten year range between 2010 and 2020. We will also set the Characteristic Type to “Nutrient” and “Physical”, because these types contain nitrogen, pH and water temperature Characteristics. It is possible to query by a specific Characteristic Name, but different organizations may input the same name in a variety of ways. Therefore, we recommend keeping your initial queries as broad as possible so all relevant results are downloaded and useful data from synonymous Characteristics are not missed. There are TADA functions that can be used to harmonize these synonyms.
# Import data from WQP
data <- TADA_DataRetrieval(
statecode = "IL",
startDate = "2010-01-01",
endDate = "2020-12-31",
huc = c("0714010505", "0714010504", "0714010508", "0714010501", "0714010503"),
applyautoclean = TRUE,
ask = FALSE
)
AutoClean
As part of TADA_DataRetrieval
,
TADA_AutoClean
is performed automatically unless specified
otherwise by the user (by setting applyautoclean = FALSE). When
TADA_AutoClean
is run, the following functions are
performed on the data retrieved from the WQP:
TADA_ConvertSpecialChars
- converts result value columns to numeric and flags non-numeric values that could not be converted.TADA_ConvertResultUnits
- unifies result units for easier quality control and reviewTADA_ConvertDepthUnits
- converts depth units to a consistent unit (meters).TADA_IDCensoredData
- categorizes detection limit data and identifies mismatches in result detection condition and result detection limit type.Other helpful actions - converts important text columns to all upper-case letters, and uses WQX format rules to harmonize specific NWIS metadata conventions (e.g. move characteristic speciation from the TADA.ResultMeasure.MeasureUnitCode column to the TADA.MethodSpeciationName column)
As a general rule, TADA functions do not change any contents in the WQP-served columns. Instead, they add new columns with the prefix “TADA.” This allows users to easily review any changes TADA functions have made.
Data Retrieval - Geospatial
All geospatial data used in this example are downloaded from the Assessment Total Maximum Daily Load (TMDL) Tracking and Implementation System (ATTAINS) (https://www.epa.gov/waterdata/attains). ATTAINS is an online system for accessing information about conditions in the Nation’s surface waters.
This information reported to EPA by states is available in ATTAINS. The public information is made available via ATTAINS web services. New TADA functions leverage the ATTAINS geospatial services to make geospatial information including catchment and assessment unit geometry easily accessible to R users.
We can use the function, TADA_GetATTAINS
to obtain
geospatial data from ATTAINS relevant to the Monitoring
Locations included in the WQP data set. See TADA
Module 2 for a much more detailed look at the logic behind this and
the other TADA geospatial functions.
# Import data from ATTAINS geospatial services
ATTAINS_data <- TADA_GetATTAINS(data)
View Geospatial Features
The TADA_ViewATTAINS
function allows us to see where the
monitoring locations from the WQP data set are relative
to ATTAINS-indexed catchments and assessment units.
This can be helpful when deciding which Monitoring Locations should be
retained for additional analysis. For this demo, we are focusing on a
single Assessment Unit, IL-84, with multiple monitoring locations
(MonitoringLocationIdentifier).
# View catchments and assessment units on map
ATTAINS_map <- TADA_ViewATTAINS(ATTAINS_data)
ATTAINS_map
Monitoring Location Filter and Review
Now that we’ve associated geospatial data from ATTAINS with the WQP data, we can filter the data set to retain only results that were collected in the specified Assessment Unit. We can also generate a new table to give us some information about the individual monitoring locations within the assessment unit. After this filter step is complete, we can remove the “ATTAINS.” prefixed columns to reduce the size of the data set, as we won’t need them for the remaining steps in this example.
We can also create a table with some basic information about the Monitoring Locations in Assessment Unit IL-84 and a pie chart to display the relative number of results contributed by each organization.
# Filter data for specified assessment unit
AUID_data <- ATTAINS_data$TADA_with_ATTAINS %>%
dplyr::filter(ATTAINS.assessmentunitidentifier == "IL_I-84")
Analysis_data <- ATTAINS_data$TADA_with_ATTAINS %>%
dplyr::filter(ATTAINS.assessmentunitidentifier == "IL_I-84") %>%
dplyr::select(-contains("ATTAINS.")) %>%
sf::st_drop_geometry() %>%
TADA_RetainRequired()
# Create table of monitoring location identifiers
MonitoringLocations <- Analysis_data %>%
dplyr::select(MonitoringLocationName, MonitoringLocationIdentifier, OrganizationFormalName) %>%
dplyr::distinct()
DT::datatable(MonitoringLocations, fillContainer = TRUE)
# Create pie of results by organization
Orgs_pie <- TADA_FieldValuesPie(Analysis_data, field = "OrganizationFormalName")
Orgs_pie
This allows us to easily review how many monitoring locations, organizations, and results are in the dataframe for the selected time period.
Water Chemistry Data Preparation and QC
Before we can begin comparisons between the water chemistry results and a water quality standard, there are some additional data preparation and QC steps that can be performed using TADA functions.
First, we can check the data for additional conditions which might
cause us to exclude certain results from further analysis. For example,
we can use TADA_FlagMethod
to check for invalid analytical
method-characteristic combinations; TADA_FlagSpeciation
to
check for invalid characteristics-method speciation combinations,
TADA_FlagResultUnit
to check the the validity of each
characteristic-media-result unit combination, and
TADA_FlagFraction
to check the validity of
characteristic-fraction combinations. In this simple example, we will
remove all invalid combinations identified by each of these functions.
When setting up your own workflow, you will want to review each function
and its output more carefully before moving on.
# Flag and remove results
Analysis_data <- Analysis_data %>%
TADA_FlagMethod(clean = TRUE) %>%
TADA_FlagSpeciation(clean = "both") %>%
TADA_FlagResultUnit(clean = "both") %>%
TADA_FlagFraction(clean = TRUE)
Another set of TADA flagging functions,
TADA_FlagAboveThreshold
and
TADA_FlagBelowThreshold
, can be used to check results
against national lower and upper thresholds. As in the previous
examples, we will set these functions to remove any results that fall
outside the national thresholds.
# Flag and remove results
Analysis_data <- Analysis_data %>%
TADA_FlagAboveThreshold(clean = TRUE) %>%
TADA_FlagBelowThreshold(clean = TRUE)
We can also review the data set to see if it contains potential duplicate results from within a single organization or potential duplicates from within multiple organizations (such as when two or more organizations monitor the same location and may submit duplicate results).
We will select to keep only unique samples from
TADA_FindPotentialDuplicatesSingleOrg
by filtering for
TADA.SingleOrgDup.Flag equals “Unique”. Then we will search for
potential duplicates from multiple orgs with
TADA_FindPotentialDuplicatesMultipleOrgs
and filter only to
retain only one value per potential duplicate by filtering for
TADA.ResultSelectedMultipleOrgs equals “Y”.
If you would like to prioritize results from one organization over
another, this can be done using the org_hierarchy argument in
TADA_FindPotentialDuplicatesMultipleOrgs
. Information on
how to this can be found in the function reference for TADA_FindPotentialDuplicatesMultipleOrgs.
In this example, we will use the default setting “none” where a single
representative is selected randomly from each duplicate group.
# Flag and remove results
Analysis_data <- Analysis_data %>%
TADA_FindPotentialDuplicatesSingleOrg() %>%
dplyr::filter(TADA.SingleOrgDup.Flag == "Unique") %>%
TADA_FindPotentialDuplicatesMultipleOrgs(
dist_buffer = 100,
org_hierarchy = "none"
) %>%
dplyr::filter(TADA.ResultSelectedMultipleOrgs == "Y")
Next we can remove QC and qualified samples (with qualifiers
identified as suspect) using TADA_FindQCActivities
and
TADA_FlagMeasureQualifierCode
, respectively. See the TADA
Module 1 vignettes for more information on how these functions identify
and flag results for removal as you may wish to make different decisions
than the TADA defaults regarding which results to retain.
# Flag and remove results
Analysis_data <- Analysis_data %>%
TADA_FindQCActivities(clean = TRUE) %>%
TADA_FlagMeasureQualifierCode(clean = TRUE)
Censored Data
Censored data are measurements for which the true value is not known,
but we can estimate the value based on lower or upper detection
conditions and limit types. TADA fills missing
TADA.ResultMeasureValue and
TADA.ResultMeasure.MeasureUnitCode values with values and units
from TADA.DetectionQuantitationLimitMeasure.MeasureValue and
TADA.DetectionQuantitationLimitMeasure.MeasureUnitCode,
respectively, using the TADA_AutoClean
function.
The TADA package currently has functions that summarize censored data
incidence in the dataset and perform simple substitutions of censored
data values, including x times the detection limit and random selection
of a value between 0 and the detection limit. The user may specify the
methods used for non-detects and over-detects separately in the input to
the TADA_SimpleCensoredMethods
function. The next step we
take in this example is to perform simple conversions to the censored
data in the dataset: we keep over-detects as is (no conversion made) and
convert non-detect values to 0.5 times the detection limit (half the
detection limit).
See the TADA Module 1 vignettes for more detailed information on the
logic behind
TADA_SimpleCensoredMethods
.
After running TADA_SimpleCensoredMethods
, we will also
filter the data to remove any remaining NAs.
Analysis_data <- TADA_SimpleCensoredMethods(Analysis_data,
nd_method = "multiplier",
nd_multiplier = 0.5,
od_method = "as-is",
od_multiplier = "null"
) %>%
dplyr::filter(!is.na(TADA.ResultMeasureValue))
Harmonize Results
The TADA_GetSynonymRef
function generates a synonym
reference table that is specific to the input dataframe. Users can
review how their input data relates to standard TADA values for the
following elements: TADA.CharacteristicName,
TADA.ResultSampleFractionText, TADA.MethodSpeciationName, and
TADA.ResultMeasure.MeasureUnitCode and edit if desired.
Users can also edit the reference file to meet their needs if
desired. The download argument can be used to save the harmonization
file to your current working directory when
download = TRUE
, the default is
download = FALSE
. In this example, we will modify the
harmonization file to group results with the fraction “NA” and results
with the fraction “TOTAL” for pH and water temperature.
The TADA_HarmonizeSynonyms
function then compares the
input dataframe to the TADA Synonym Reference Table and makes
conversions where target characteristics/fractions/speciations/units are
provided. This function also appends a column called
TADA.Harmonized.Flag, indicating which results had metadata
changed/converted in this function. The purpose of this function is to
make similar data consistent and therefore easier to compare and
analyze.
UniqueHarmonizationRef <- TADA_GetSynonymRef(Analysis_data)
UniqueHarmonizationRef_edit <- UniqueHarmonizationRef %>%
dplyr::mutate(
Target.TADA.CharacteristicName = ifelse(TADA.CharacteristicName == "TEMPERATURE, WATER", "TEMPERATURE", TADA.CharacteristicName),
Target.TADA.ResultSampleFractionText = ifelse(TADA.CharacteristicName == "PH", NA, Target.TADA.ResultSampleFractionText),
HarmonizationGroup = ifelse(TADA.CharacteristicName == "PH", "PH", HarmonizationGroup),
HarmonizationGroup = ifelse(Target.TADA.CharacteristicName == "TEMPERATURE", "TEMPERATURE",
HarmonizationGroup
)
)
Harmonized_data <- TADA_HarmonizeSynonyms(Analysis_data,
ref = UniqueHarmonizationRef_edit
)
Total Nitrogen and Total Phosphorus Calculations
TADA_CalculateTotalNP
uses the Nutrient
Aggregation logic to add together specific subspecies to obtain a
total. TADA adds one more equation to the mix: total particulate
nitrogen + total dissolved nitrogen. The function uses as many
subspecies as possible to calculate a total for each given site, date,
and depth group, but it will estimate total nitrogen with whatever
subspecies are present. This function creates NEW total nutrient
measurements (total nitrogen unfiltered as N and total phosphorus
unfiltered as P) and adds them to the dataframe.
Users can use the default summation worksheet (see
TADA_GetNutrientSummationRef
) or customize it to suit their
needs. The function also requires a daily aggregation value, either
minimum, maximum, or mean. The default is ‘max’, which means that if
multiple measurements of the same subspecies-fraction-speciation-unit
occur on the same day at the same site and depth, the function will pick
the maximum value to use in summation calculations. In this example, we
will use the default summation worksheet to obtain total nitrogen.
Harmonized_data <- TADA_CalculateTotalNP(Harmonized_data, daily_agg = "max")
Parameter Level Filtering
Now we can filter to our three parameters of interest and use
TADA_Stats
take a look at some basic statistics for each,
including location count, measurement count, min, max, and more.
# review unique identifiers
unique(Harmonized_data$TADA.ComparableDataIdentifier)
# filter for three comparable data identifiers of interest
Filtered_data <- Harmonized_data %>%
dplyr::filter(TADA.ComparableDataIdentifier %in% c("TEMPERATURE_NA_NA_DEG C", "PH_NA_NA_STD UNITS", "TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L"))
# generate stats table
Filtered_data_stats <- TADA_Stats(Filtered_data)
DT::datatable(Filtered_data_stats, fillContainer = TRUE)
Standard Comparisons With Scatterplot and Pie Chart
Next, we will create a two characteristic scatterplot of pH and
temperature to begin visualizing the data, using
TADA_TwoCharacteristicScatterplot
.
# choose two and generate scatterplot
TADA_TwoCharacteristicScatterplot(Harmonized_data, id_cols = "TADA.ComparableDataIdentifier", groups = c("TEMPERATURE_NA_NA_DEG C", "PH_NA_NA_STD UNITS"))
Let’s focus on just one characteristic, pH. We can filter the
dataframe to retain only pH samples. We can also add a column to
indicate whether each result falls within the water quality standard
range for pH (6.5 - 9). From this pH-only frame we can create a table
with just a few columns for easier review, and a single characteristic
scatterplot using TADA_Scatterplot
. In this example,
horizontal lines have been added to indicate the upper and lower ends of
the water quality standard. These are not a default option in
TADA_Scatterplot
but can be added via the
plotly
package.
# comparison to standard for ph
pH_Standard <- Filtered_data %>%
dplyr::filter(TADA.ComparableDataIdentifier == "PH_NA_NA_STD UNITS") %>%
dplyr::mutate(MeetsStandard = ifelse(TADA.ResultMeasureValue >= 6.5 & TADA.ResultMeasureValue <= 9, "Yes", "No"))
# subset of pH data with fewer rows
pH_Table <- pH_Standard %>%
dplyr::select(
MonitoringLocationIdentifier, OrganizationFormalName, ActivityStartDate, TADA.ResultMeasureValue,
MeetsStandard
)
DT::datatable(pH_Table, fillContainer = TRUE)
# pH scatterplot
pH_Scatter <- TADA_Scatterplot(pH_Standard, id_cols = "TADA.ComparableDataIdentifier") %>%
plotly::add_lines(
y = 6.5,
x = c(min(pH_Standard$ActivityStartDate), max(pH_Standard$ActivityStartDate)),
inherit = FALSE,
showlegend = FALSE,
line = list(color = "red"),
hoverinfo = "none"
) %>%
plotly::add_lines(
y = 9,
x = c(min(pH_Standard$ActivityStartDate), max(pH_Standard$ActivityStartDate)),
inherit = FALSE,
showlegend = FALSE,
line = list(color = "red"),
hoverinfo = "none"
)
pH_Scatter
Comparison of temperature to water quality standards is a little more complicated as the maximum temperature varies by season and there are multiple components of the temperature standard. For this example, we will focus on the “shall never exceed” seasonal temperature standards.
These are:
17.7 deg C for January, February, March, and December
33.7 deg C for April, May, June, July, August, September, October, and November.
We can use lubridate
and dplyr
functions to
assign the appropriate standard to each result by using
lubridate::month
to identify the month in which each sample
was collected and dplyr::mutate
to create a new column for
the temperature standard. Then the results can be compared to the
standard.
We can use TADA_Scatterplot
again to visualize the data.
Due to the seasonal variation in the standard and the ten-year date
range in our data set, we may want to consider other visualizations to
visually review the number of results not meeting the standard. We can
use TADA_FieldValuesPie
on the MeetsStandard field we
created to see how many samples met or did not meet the standard.
Creating a table also allows for easy filtering of results to identify the results that did not meet the standard and review during which years and seasons they occurred.
# comparison to standard for temperature
Temp_Standard <- Filtered_data %>%
dplyr::filter(TADA.ComparableDataIdentifier == "TEMPERATURE_NA_NA_DEG C") %>%
dplyr::mutate(
MonthForAnalysis = lubridate::month(ActivityStartDate),
TempStandard = ifelse(MonthForAnalysis %in% c(1, 2, 3, 12), 17.7, 33.7),
MeetsStandard = ifelse(TADA.ResultMeasureValue < TempStandard,
"Yes", "No"
)
)
# create scatterplot for temperature
Temp_Scatter <- TADA_Scatterplot(Temp_Standard, id_cols = "TADA.ComparableDataIdentifier")
Temp_Scatter
# create pie chart for temperature
Temp_Pie <- TADA_FieldValuesPie(Temp_Standard, field = "MeetsStandard")
Temp_Pie
# create subset table for temperature
Temp_Table <- Temp_Standard %>%
dplyr::select(
MonitoringLocationIdentifier, OrganizationFormalName, ActivityStartDate, TADA.ResultMeasureValue, TempStandard,
MeetsStandard
)
DT::datatable(Temp_Table, fillContainer = TRUE)
Additional Data Exploration
We may also want to use TADA functions for exploration and visualization of characteristics without water quality standards. For this example, we will use “TOTAL NITROGEN, MIXED FORMS_FILTERED, LAB_NA_MG/L”.
To better understand the distribution of results, we can use the TADA
functions TADA_Histogram
and TADA_Boxplot
.
TADA_Histogram
can be useful for identifying the overall
shape of the data.
# filter dataframe to comparable data identifier of interest
Nitrogen_data <- dplyr::filter(Filtered_data, TADA.ComparableDataIdentifier == "TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L")
# generate a histogram
Nitrogen_Histogram <- TADA_Histogram(Nitrogen_data, id_cols = "TADA.ComparableDataIdentifier")
# view histogram
Nitrogen_Histogram
TADA_Boxplot
can be useful for identifying skewness and
percentiles.
Nitrogen_Boxplot <- TADA_Boxplot(Nitrogen_data, id_cols = "TADA.ComparableDataIdentifier")
Nitrogen_Boxplot
Filtering the results of TADA_Stats
for only “TOTAL
NITROGEN, MIXED FORMS_TOTAL RECOVERABLE_NA_MG/L” and selecting a smaller
subset of columns or creating a single characteristic scatterplot may
also provide useful information.
# create table with nitrogen stats
Nitrogen_stats <- Filtered_data_stats %>%
dplyr::filter(TADA.ComparableDataIdentifier == "TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L") %>%
dplyr::select(Location_Count, Measurement_Count, Min, Max, Mean)
DT::datatable(Nitrogen_stats, fillContainer = TRUE)
# create nitrogen scatterplot
Nitrogen_Scatterplot <- TADA_Scatterplot(Nitrogen_data, id_cols = "TADA.ComparableDataIdentifier")
Nitrogen_Scatterplot
Reproducible and Documented
Major benefits to this type are workflow are that it is reproducible and the decisions at each step are well documented. This means that it is easy to go back and review every step, understand the decisions that were made, make changes as necessary, and run it again. If someone asks “How were data filtered prior to analysis?” or “How did you identify additional Monitoring Locations within the Assessment Unit?”, you can refer to the code and function documentation to provide answers.
For example, we could change the Assessment Unit of interest, modify
the relevant code chunks (potentially including the WQP query if the new
Assessment Unit is in a different HUC or state) and repeat the same
analysis for the same characteristics at a different location. We could
also modify the harmonization reference to group additional comparable
data (if appropriate) or add a step to run
TADA_ConvertResultUnits
an additional time outside of
TADA_AutoClean
if we wanted to convert temperature results
to deg F for analysis.
Or we could change or add additional characteristics to our analysis. We could even write additional custom functions or even use functions from other packages as needed for more complex standards, evaluate trends, or answer other questions using a TADA dataframe.
The code chunk below displays the elapsed time it took to run the code chunks and create this document, providing an example of how incorporating TADA in your workflow can increase efficiency.
end.time <- Sys.time()
end.time - start.time