TADA R Package Training: A Markdown for Shepherdstown
TADA Team
2024-03-29
Source:vignettes/TADATrainingShepherdstown.Rmd
TADATrainingShepherdstown.Rmd
Welcome!
Thank you for your interest in Tools for Automated Data Analysis (TADA). TADA is an open-source tool set built in the R programming language and available for anyone to download and edit to their specific needs. This RMarkdown document walks users through how to download the TADA R package from GitHub, access and parameterize several important functions with a sample dataset, and create basic visualizations. The workflow is similar to a funnel: at each decision point, data that fail QC checks are removed from the core dataset and placed in a separate dataframe, while data that pass are carried into the next step. At the end of the QC checks, the user should be confident that their data are properly documented and applicable to the analysis at hand.
Note: TADA is still under development. New functionality is added weekly, and sometimes we need to make bug fixes in response to tester and user feedback. We appreciate your feedback, patience, and interest in these helpful tools.
Customize or contribute
TADA is housed in a repository on GitHub. Users desiring to review the base code and customize the package for their own purposes may:
Clone the repository using Git
Open the repository using GitHub Desktop, or
Download a zip file of the repository to their desktop.
Interested in contributing to the TADA package? The TADA team highly encourages input and development from users. Check out the Contributing page on the TADA GitHub site for guidance on collaboration conventions.
Install and setup
Users can install the TADA package from GitHub into their R library
using the remotes
package. Copy and paste the code below
into your R or RStudio console to download and install.
TADA package relies on other packages, therefore you may be prompted in the console to update dependency packages that have more recent versions available. If you see this prompt, it is recommended to update all of them (enter 1 into the console).
# Install TADA
if (!"remotes" %in% installed.packages()) {
install.packages("remotes")
}
remotes::install_github("USEPA/TADA", ref = "develop")
library(TADA)
It’s that easy! The most stable branch for TADA right now is the
develop branch. Contributors generally create their own branches based
on develop, make some improvements, and then submit a pull request to be
reviewed by the TADA Team. Once approved, updates are then merged into
the develop branch. However, you are welcome to download any branch
you’d like using the ref
input in
install_github
(see code chunk above). This functionality
is mainly only useful to TADA package developers/contributors.
The following code block ensures the additional packages needed to
run the code in this RMarkdown document are loaded. However, users may
also use the package name:: package function
notation to
avoid the list of library()
calls.
list.of.packages <- c("tidyverse")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages)) install.packages(new.packages)
library(tidyverse)
Help pages
All TADA R package functions have their own individual help pages,
listed on the Function
reference page on the GitHub site. Users can also access the help
page for a given function in R or RStudio using the following format
(example below): ?TADA::[name of TADA function]
?TADA::TADA_DataRetrieval
Upload data
Now let’s start using the TADA R package functions. The first step is to bring a dataset into the R environment. TADA is designed to work with Water Quality Portal (WQP) data. This means that all of its functions will look for WQP column names and create new TADA-specific columns based on these elements. Users may upload their own custom dataset into R for use with TADA by ensuring their column names and data formats (e.g. numeric, character) align with WQP profiles.
If you are interested in reviewing the column headers and formats
required to run TADA, use the function below, which saves an example
spreadsheet to the user’s working directory. You can also take a look at
an example dataset, like TADA::Data_Nutrients_UT
to get an
idea of the data structure and format.
getwd() # find your working directory
TADA::TADA_GetTemplate() # download template to working directory
# uncomment below to review example dataset
# Data_Nutrients_UT <- TADA::Data_Nutrients_UT
TADA_DataRetrieval
is built upon USGS’s
readWQPdata
function within the dataRetrieval package,
which uses web service calls to bring WQP data into the R environment.
Additionally, TADA_DataRetrieval
performs some basic
quality control checks on the data using new TADA-specific columns to
preserve the original dataset:
Converts key character columns to ALL CAPS for easier harmonization and validation.
Removes complete duplicate results.
Identifies different classes of result values (numeric, text, percentage, comma-separated numeric, greater than/less than, numbers preceded by a tilde, etc.) and converts values to numeric where feasible.
Fills NA result values and units with detection limit values and units where the detection condition text indicates a censored data value. Identifies censored data results and categorizes them as over-detect, non-detect, or other. These data preprocessing features facilitate more complex detection limit value handling using other TADA functions (more details to come later in this training).
Unifies result and depth units to common units to improve ease of data harmonization. See
?TADA_ConvertResultUnits
and?TADA_ConvertDepthUnits
for more information on these processes.
Let’s give it a try. TADA_DataRetrieval
follows similar
parameterization to readWQPdata
, but check out the help
page or enter ?TADA::TADA_DataRetrieval into the console for more
information about input parameters and to see several examples.
# download data for NCTC's HUC12
# dataset_0 = TADA::TADA_DataRetrieval(
# startDate = "2020-06-22",
# endDate = "null",
# countycode = "null",
# huc = "02070004",
# siteid = "null",
# siteType = "null",
# characteristicName = "null",
# characteristicType = "null",
# sampleMedia = "null",
# statecode = "null",
# organization = "null",
# project = "null",
# applyautoclean = TRUE
# )
# For brevity, we'll skip pinging the WQP and instead load the example dataset:
dataset_0 <- TADA::Data_NCTCShepherdstown_HUC12
# Let's take a look at all of the TADA-created columns:
names(dataset_0)[grepl("TADA.", names(dataset_0))]
Currently, the TADA_DataRetrieval
function combines
three WQP data profiles: Sample Results (Physical/Chemical), Site data,
and Project data. This ensures that all important quality control
columns are included in the dataset.
Note: USGS and EPA are working together to create
WQP 3.0 data profiles. Once released, one data profile will contain the
columns critical to TADA, removing the need to combine profiles in this
first step. TADA package users likely will not notice a difference in
their usage of the TADA_DataRetrieval
function, but it will
simplify the steps needed to upload a custom or WQP GUI-downloaded
dataset into the R package.
Initial data review
Now that we’ve pulled the data into the R session, let’s take a look at it. Note that any column names with a leading “TADA.” were generated from the TADA_DataRetrieval function.
First, always good to take a look at the data frame dimensions.
Question 1: What are the dimensions of your dataset?
dim(dataset_0) # returns x and of x (as the numbers of rows and columns respectively)
Before we start filtering and flagging our data, let’s create a function (dimCheck) that performs dimension checks between the results that pass each filter or QC flag check (and are retained) and those that do not (and are removed). These dimension checks ensure that the total number of rows in the original input dataset (all_result_num) equal the the total number of rows added up between the passing (pass_data) and removed (fail_data) data frames.
# defining a dimension check function that compares removed and retained data dimensions against the initial data input
dimCheck <- function(all_result_num, pass_data, fail_data, checkName) {
# check result numbers after split
final_result_num <- dim(pass_data)[1] + dim(fail_data)[1]
# always good to do a dimension check
if (!all_result_num == final_result_num) {
print(paste0("Help! Results do not add up between dataset and removed after ", checkName, " check."))
} else {
print(paste0("Good to go. Zero results created or destroyed in ", checkName, " check."))
}
}
# let's first get the total number of rows in the dataset.
all_result_num <- dim(dataset_0)[1]
Next, we can use the TADA TADA_FieldCounts()
function to
see how many unique values are contained within each column of the
dataset. The function can either return all column counts, most, or just
the key columns. We’ll try the input with display = "key"
and display = "all"
. Enter ?TADA::TADA_FieldCounts() into
the console for more information on this function.
Question 2: Which column should have a unique value in every row and why?
TADA::TADA_FieldCounts(dataset_0, display = "key")
TADA::TADA_FieldCounts(dataset_0, display = "all")
Question 3: How many unique ‘TADA.ActivityMediaName’ values exist in your dataset? Are there any media types that are not water?
TADA is currently designed to accommodate water data from the WQP. Let’s ensure that we remove all non-water data first.
# remove data with media type that is not water
removed <- dataset_0 %>%
dplyr::filter(!TADA.ActivityMediaName %in% c("WATER")) %>%
dplyr::mutate(TADA.RemovalReason = "Activity media is not water.")
# what other media types exist in dataset?
unique(removed$TADA.ActivityMediaName)
# clean dataset contains only water
dataset <- dataset_0 %>% dplyr::filter(TADA.ActivityMediaName %in% c("WATER"))
dimCheck(all_result_num, dataset, removed, checkName = "Activity Media")
Two additional helper functions one can use at any step in the
process are TADA_FieldValuesTable()
and
TADA_FieldValuesPie()
. These functions create a summary
table and pie chart (respectively) of all the unique values in a given
column. Let’s give it a try on ActivityTypeCode, which is a WQP column
describing the type of sample collected for each result.
TADA::TADA_FieldValuesTable(dataset, field = "OrganizationFormalName")
TADA::TADA_FieldValuesPie(dataset, field = "OrganizationFormalName")
Question 4: When might a user choose to view a column’s unique values as a table rather than in a pie chart?
We can take a quick look at some of the TADA-created columns that
review result value types. Because TADA is intended to work with numeric
data, at this point, it would be good to remove those result values that
are NA without any detection limit info, or contain text or special
characters that cannot be converted to numeric. Note that TADA will fill
in missing values with detection limit values and units if the
ResultDetectionConditionText and DetectionQuantitationLimitType fields
are populated. Use ?TADA_ConvertSpecialChars
for more
details on result value types and handling.
# take a look at datatypes
TADA::TADA_FieldValuesTable(dataset, field = "TADA.ResultMeasureValueDataTypes.Flag")
# Numeric or numeric-coerced data types
rv_datatypes <- unique(subset(dataset, !is.na(dataset$TADA.ResultMeasureValue))$TADA.ResultMeasureValueDataTypes.Flag)
rv_datatypes
# Non-numeric data types coerced to NA
na_rv_datatypes <- unique(subset(dataset, is.na(dataset$TADA.ResultMeasureValue))$TADA.ResultMeasureValueDataTypes.Flag)
na_rv_datatypes
# these are all of the NOT allowable data types in the dataset.
incompatible_datatype <- dataset %>%
filter(!TADA.ResultMeasureValueDataTypes.Flag %in% c("Numeric", "Less Than", "Greater Than", "Approximate Value", "Percentage", "Comma-Separated Numeric", "Numeric Range - Averaged", "Result Value/Unit Copied from Detection Limit")) %>%
dplyr::mutate(TADA.RemovalReason = "Result value type cannot be converted to numeric or no detection limit values provided.")
# take a look at the difficult data types - do they make sense?
check <- unique(incompatible_datatype[, c("TADA.CharacteristicName", "ResultMeasureValue", "TADA.ResultMeasureValue", "ResultMeasure.MeasureUnitCode", "TADA.ResultMeasure.MeasureUnitCode", "TADA.ResultMeasureValueDataTypes.Flag", "DetectionQuantitationLimitMeasure.MeasureValue", "TADA.DetectionQuantitationLimitMeasure.MeasureValue", "DetectionQuantitationLimitMeasure.MeasureUnitCode", "TADA.DetectionQuantitationLimitMeasure.MeasureUnitCode")])
View(check)
dataset <- dataset %>% filter(TADA.ResultMeasureValueDataTypes.Flag %in% c("Numeric", "Less Than", "Greater Than", "Approximate Value", "Percentage", "Comma-Separated Numeric", "Numeric Range - Averaged", "Result Value/Unit Copied from Detection Limit"))
removed <- plyr::rbind.fill(removed, incompatible_datatype)
rm(incompatible_datatype, check, rv_datatypes, na_rv_datatypes)
dimCheck(all_result_num, dataset, removed, checkName = "Result Format")
During TADA_DataRetrieval
, TADA automatically sorts
result values into detection limit categories (e.g. non-detect,
over-detect) based on populated values in the
ResultDetectionConditionText and DetectionQuantitationLimitTypeName
columns. You can find the reference tables used to make these decisions
in TADA_GetDetCondRef()
and
TADA_GetDetLimitRef()
functions. In some cases, results are
missing detection limit/condition info, or there is a conflict in the
detection limit and condition. The user may want to remove problematic
detection limit data before proceeding.
TADA::TADA_FieldValuesPie(dataset, field = "TADA.CensoredData.Flag")
problem_censored <- dataset %>%
dplyr::filter(!TADA.CensoredData.Flag %in% c("Non-Detect", "Over-Detect", "Other", "Uncensored")) %>%
dplyr::mutate(TADA.RemovalReason = "Detection limit information contains errors or missing information.")
# Let's take a look at the problematic data that we filtered out (if any)
check <- unique(problem_censored[, c("TADA.CharacteristicName", "ResultDetectionConditionText", "DetectionQuantitationLimitTypeName", "TADA.CensoredData.Flag")])
View(check)
Data flagging
We’ve taken a quick look at the raw dataset and split off some data
that are not compatible with TADA, now let’s run through some quality
control checks. The most important ones to run to ensure your dataset is
ready for subsequent steps are TADA_FlagFraction()
,
TADA_FlagSpeciation()
, TADA_FlagResultUnit()
,
and TADA_FindQCActivities()
. With the exception of
TADA_FindQCActivities()
, these flagging functions leverage
WQX’s QAQC
Validation Table. TADA_FindQCActivities()
uses a
TADA-specific domain table users can review with
TADA_GetActivityTypeRef()
. All QAQC tables are frequently
updated in the package to ensure they match the latest version on the
web. You can find guidance for using the WQX QAQC Validation Tables in
this vignette
on the TADA GitHub site.
Bring the QAQC Validation Table into your R session to view or save with the following command:
qaqc_ref <- TADA::TADA_GetWQXCharValRef()
View(qaqc_ref)
unique(qaqc_ref$Type)
Question 5: What do you think the qaqc_ref$Type
column indicates?
TADA joins this validation table to the data and uses the “Valid” and “Invalid” labels in the Status column to create easily understandable flagging columns for each function. Let’s run these four flagging functions.
dataset_flags <- TADA::TADA_FlagFraction(dataset, clean = FALSE, flaggedonly = FALSE)
dataset_flags <- TADA::TADA_FlagSpeciation(dataset_flags, clean = "none", flaggedonly = FALSE)
dataset_flags <- TADA::TADA_FlagResultUnit(dataset_flags, clean = "none", flaggedonly = FALSE)
dataset_flags <- TADA::TADA_FindQCActivities(dataset_flags, clean = FALSE, flaggedonly = FALSE)
dimCheck(all_result_num, dataset_flags, removed, checkName = "Run Flag Functions")
Question 6: Did any warnings or messages appear in the console after running these flagging functions? What do they say?
Now that we’ve run all the key flagging functions, let’s take a look at the results and make some decisions.
TADA::TADA_FieldValuesPie(dataset_flags, field = "TADA.SampleFraction.Flag")
TADA::TADA_FieldValuesPie(dataset_flags, field = "TADA.MethodSpeciation.Flag")
TADA::TADA_FieldValuesPie(dataset_flags, field = "TADA.ResultUnit.Flag")
TADA::TADA_FieldValuesPie(dataset_flags, field = "TADA.ActivityType.Flag")
Any results flagged as “Invalid” are recognized in the QAQC Validation Table as having some data quality issue. “NonStandardized” means that the format has not been fully vetted or processed, while “Valid” confirms that the characteristic combination is widely recognized as correctly formatted. Let’s add any invalid results to the removed dataset for later review.
Note: if you find any errors in the QAQC Validation Table, please contact the WQX Help Desk at WQX@epa.gov to help correct it. Thanks in advance!
# grab all the flagged results from the four functions
problem_flagged <- dataset_flags %>%
filter(TADA.SampleFraction.Flag == "Invalid" | TADA.MethodSpeciation.Flag == "Invalid" | TADA.ResultUnit.Flag == "Invalid" | !TADA.ActivityType.Flag %in% ("Non_QC")) %>%
dplyr::mutate(TADA.RemovalReason = "Invalid Unit, Method, Speciation, or Activity Type.")
dataset_flags <- dataset_flags %>% dplyr::filter(!ResultIdentifier %in% problem_flagged$ResultIdentifier)
removed <- plyr::rbind.fill(removed, problem_flagged)
rm(problem_flagged)
dimCheck(all_result_num, dataset_flags, removed, checkName = "Filter Flag Functions")
Question 7: Are there any other metadata columns that you review and filter in your workflow?
We’ve finished running the recommended flagging functions and removing results that do not pass QC checks. Let’s look at the breakdown of these data in the removed object.
TADA::TADA_FieldValuesTable(removed, field = "TADA.RemovalReason")
You can review any other columns of interest and create custom domain tables of your “Valid” and “Invalid” criteria using R or Excel. Also check out some of the other flagging functions available in TADA:
?TADA_FlagAboveThreshold()
?TADA_FlagBelowThreshold()
?TADA_FindContinuousData()
?TADA_FlagCoordinates()
?TADA_FlagMethod()
?TADA_FindQAPPApproval()
?TADA_FindQAPPDoc()
?TADA_FindPotentialDuplicates()
- in development
Please let us know of other flagging functions you think would have broad appeal in the TADA package or need assistance brainstorming/developing.
Censored data handling
We have already identified, flagged, and in some cases removed problematic detection limit data from our dataset, but that doesn’t keep them from being difficult. Because we do not know the result value with adequate precision, water quality data users often set non-detect values to some number below the reported detection limit. TADA contains some simple methods for handling detection limits: users may multiply the detection limit by some number between 0 and 1, or convert the detection limit value to a random number between 0 and the detection limit. More complex detection limit estimation requiring regression models (Maximum Likelihood, Kaplan-Meier, Robust Regression on Order Statistics) or similar must be performed outside of the current version of TADA (though future development is planned).
dataset_cens <- TADA::TADA_SimpleCensoredMethods(dataset_flags, nd_method = "multiplier", nd_multiplier = 0.5, od_method = "as-is")
Question 8: How would you parameterize
TADA_SimpleCensoredMethods()
to make non-detect values a
random number between 0 and the provided detection limit?
Let’s take a look at how the censored data handling function affects
the TADA.ResultMeasureValueDataTypes.Flag
column.
# before
TADA::TADA_FieldValuesTable(dataset_flags, field = "TADA.ResultMeasureValueDataTypes.Flag")
# after
TADA::TADA_FieldValuesTable(dataset_cens, field = "TADA.ResultMeasureValueDataTypes.Flag")
Question 9: Is there a difference between the first and second tables?
If you’d like to start thinking about using statistical methods to
estimate detection limit values, check out the
?TADA_SummarizeCensoredData
function, which accepts
user-defined data groupings to prescribe estimation tests based on the
number of results, % of dataset censored, and number of censoring levels
(detection limits). The decision tree used in the function was outlined
in an National
Nonpoint Source Tech Memo.
Data exploration
How are you feeling about your test dataset? Does it seem ready for
the next step(s) in your analyses? There’s probably a lot more you’d
like to look at/filter out before you’re ready to say: QC complete.
Let’s first check out characteristics in the dataset using
dplyr
functions and pipes.
# display full table
nrow <- length(unique(dataset_cens$TADA.CharacteristicName))
# get table of characteristics with number of results, sites, and organizations
dataset_cens %>%
dplyr::group_by(TADA.CharacteristicName) %>%
dplyr::summarise(Result_Count = length(ResultIdentifier), Site_Count = length(unique(MonitoringLocationIdentifier)), Org_Count = length(unique(OrganizationIdentifier))) %>%
dplyr::arrange(desc(Result_Count)) %>%
print(n = nrow)
You may see a characteristic that you’d like to investigate further
in isolation. TADA_FieldValuesPie()
will also produce
summary pie charts for a given column within a specific
characteristic. Let’s take a look.
# go ahead and pick a characteristic name from the table generated above. I picked nitrogen, mixed forms
TADA::TADA_FieldValuesPie(dataset_cens, field = "TADA.ResultSampleFractionText", characteristicName = "NITROGEN, MIXED FORMS (NH3), (NH4), ORGANIC, (NO2) AND (NO3)")
We can view the site locations using a TADA mapping function. In this map, the circles indicate monitoring locations in the dataset; their size corresponds to the number of results collected at that site, while the darker the circle, the more characteristics were sampled at that site.
TADA::TADA_OverviewMap(dataset_cens)
Out of curiosity, let’s take a look at a breakdown of these monitoring location types. Do they all indicate surface water samples? Depending upon your program’s goals and methods, you might want to filter out some of the types you see.
TADA::TADA_FieldValuesPie(dataset_cens, field = "MonitoringLocationTypeName")
One of the next big steps is data harmonization: translating and
aggregating synonyms, combining multiple forms/species of certain
characteristics, etc. We won’t get to that in this demo (more coming
soon in TADA!), but we can start looking at data distributions within a
single characteristic-speciation-fraction-unit using the plotting
functions TADA_Histogram()
and TADA_Boxplot()
.
We can also view a stats table using TADA_Stats
.
Let’s first take a look at the column TADA.ComparableDataIdentifier, which breaks down characteristics into groups by name, fraction, speciation, and unit. These four columns are important to evaluate together to ensure the plotted data are sufficiently similar to appear on a single plot together: it doesn’t make sense to plot characteristics with different units or fractions in the same distribution.
# trusty field values table - lets just look at the first few entries with the most associated records
View(TADA::TADA_FieldValuesTable(dataset_cens, field = "TADA.ComparableDataIdentifier"))
Now that we have an idea for what the TADA.ComparableDataIdentifier looks like, we can check out how it is used to plot distinct characteristic groups.
# Look at a histogram, boxplot, and stats for TADA.ComparableDataIdentifier(s) of your choice.
comp_data_id <- "TEMPERATURE, WATER_NA_NA_DEG C"
plot_data <- subset(dataset_cens, dataset_cens$TADA.ComparableDataIdentifier %in% comp_data_id)
TADA::TADA_Histogram(plot_data, id_cols = "TADA.ComparableDataIdentifier")
TADA::TADA_Boxplot(plot_data, id_cols = "TADA.ComparableDataIdentifier")
View(TADA::TADA_Stats(plot_data))
Finally, we can download our PASS and FAIL datasets together into an Excel spreadsheet.
dataset_and_removed <- dplyr::bind_rows(dataset_cens, removed)
# Un-comment to download Excel spreadsheet to your working directory
# install.packages(writexl)
# library(writexl)
# writexl::write_xlsx(dataset_and_removed, "NCTCShepherdstownData.xlsx")
TADA R Shiny Modules
Finally, take a look at an alternative workflow for QC’ing WQP data: TADA Shiny Module 1: Data Discovery and Cleaning. This is a Shiny application that runs many of the TADA functions covered in this training document behind a graphical user interface. The shiny application queries the WQP, contains maps and data visualizations, flags suspect data results, handles censored data, and more. You can launch it using the code below.
# download TADA Shiny repository
remotes::install_github("USEPA/TADAShiny", ref = "develop", dependencies = TRUE)
# launch the app locally.
TADAShiny::run_app()
DRAFT Module 1 is also currently hosted on the web with minimal server memory/storage allocated.