Skip to contents

Overview

We will walk through the specific use case of query, downloading, preparing, and QA/QCing data from the San Juan and Animas Rivers near Farmington, New Mexico with a focus on total dissolved solids and conductivity.

This vignette is designed to be concise and provide an example of how you might use TADA functions for basic data discovery and preperation. It leverages functions from TADA Modules 1 and 2. For more details on each function, see the TADA Function Reference or the more comprehensive module vignettes:

Load Required Packages

The fist step is to load the TADA package and other packages used in our workflow. (If you are a first-time TADA user, see TADA Module 1 Beginner Training for TADA package installation instructions.)

We’ll also record the start time for our analysis, beginning after loading packages.

# Load required packages
library(EPATADA)
library(dplyr)
library(yaml)
library(lwgeom)
library(sf)
library(lubridate)
library(knitr)
library(DT)


# Record start time
start.time <- Sys.time()

How’s My Waterway

To help define the area of interest, we can start by taking a look at How’s My Waterway by searching for Farmington, New Mexico: HMW_Farmington, NM. We can identify the Assessment Unit Identifiers for the segments of the San Juan and Animas Rivers closest to their confluence. These are:

1) San Juan River (Navajo bnd at Hogback to Animas River) - NM-2401_10

2) Animas River (San Juan River to Estes Arroyo) - NM-2403.A_00

We can also identify the HUC-12s for each segment. This information will help us build our Water Quality Portal query to find results for the segements of interest plus other nearby waterbodies. The HUC-12s are 140801050505 (San Juan River) and 140801041006 (Animas River).

Data Retrieval - Water Chemistry

The water chemistry data we will use are downloaded from the Water Quality Portal (https://www.waterqualitydata.us/). The WQP integrates publicly available water quality data from the USGS National Water Information System (NWIS) and the EPA Water Quality Exchange (WQX) Data Warehouse. The EPA water quality data originate from the Water Quality Exchange, the EPA’s repository of water quality monitoring data collected by water resource management groups across the country. Organizations, including states, tribes, watershed groups, other federal agencies, volunteer groups, and universities, submit data to the WQX.

One of the fields that can be used to query the WQP is HUC. As we know the HUC-12s of interest, we will use them to retrieve WQP data using TADA_DataRetrieval. As we are most interested in total dissovled solids and conductivity and these are in the “Physical” characteristic type group in the WQP, we will also include characteristic group type in our query. We will not set start and end dates as we are interested in all available data for these locations and characteristics. .

# Import data from WQP

data <- TADA_DataRetrieval(
  statecode = "NM",
  huc = c("140801050505", "140801041006"),
  characteristicType = "Physical",
  applyautoclean = TRUE
)
## [1] "Downloading WQP query results. This may take some time depending upon the query size."
## $statecode
## [1] "US:35"
## 
## $huc
## [1] "140801050505" "140801041006"
## 
## $characteristicType
## [1] "Physical"
## 
## [1] "Data successfully downloaded. Running TADA_AutoClean function."
## [1] "TADA_Autoclean: creating TADA-specific columns."
## [1] "TADA_Autoclean: harmonizing dissolved oxygen characterisic name to DISSOLVED OXYGEN SATURATION if unit is % or % SATURATN."
## [1] "TADA_Autoclean: handling special characters and coverting TADA.ResultMeasureValue and TADA.DetectionQuantitationLimitMeasure.MeasureValue value fields to numeric."
## [1] "TADA_Autoclean: converting TADA.LatitudeMeasure and TADA.LongitudeMeasure fields to numeric."
## [1] "TADA_Autoclean: harmonizing synonymous unit names (m and meters) to m."
## [1] "TADA_Autoclean: updating deprecated (i.e. retired) characteristic names."
## [1] "No deprecated characteristic names found in dataset."
## [1] "TADA_Autoclean: harmonizing result and depth units."
## [1] "NOTE: Dataset contains 7451 USGS results with speciation information in both the result unit and method speciation columns. This function overwrites the TADA method speciation column with the speciation provided in the result unit column."
## [1] "TADA_Autoclean: creating TADA.ComparableDataIdentifier field for use when generating visualizations and analyses."
## [1] "NOTE: This version of the TADA package is designed to work with numeric data with media name: 'WATER'. TADA_AutoClean does not currently remove (filter) data with non-water media types. If desired, the user must make this specification on their own outside of package functions. Example: dplyr::filter(.data, TADA.ActivityMediaName == 'WATER')"

AutoClean

As part of TADA_DataRetrieval, TADA_AutoClean is performed automatically unless specified otherwise by the user (by setting applyautoclean = FALSE). When TADA_AutoClean is run, the following functions are performed on the data retrieved from the WQP:

  • TADA_ConvertSpecialChars - converts result value columns to numeric and flags non-numeric values that could not be converted.

  • TADA_ConvertResultUnits - unifies result units for easier quality control and review

  • TADA_ConvertDepthUnits - converts depth units to a consistent unit (meters).

  • TADA_IDCensoredData - categorizes detection limit data and identifies mismatches in result detection condition and result detection limit type.

  • Other helpful actions - converts important text columns to all upper-case letters, and uses WQX format rules to harmonize specific NWIS metadata conventions (e.g. move characteristic speciation from the TADA.ResultMeasure.MeasureUnitCode column to the TADA.MethodSpeciationName column)

As a general rule, TADA functions do not change any contents in the WQP-served columns. Instead, they add new columns with the prefix “TADA.” This allows users to easily review any changes TADA functions have made.

Review and Filter Results By Comparable Data Identifier

Now we can review the results and decide which ones we would like to retain for further QC and analysis. First let’s take a look at all unique values in the TADA.ComparableDataIdentifier column and see how how many results are associated with each. TADA.ComparableDataIdentifier concatenates TADA.CharacteristicName, TADA.ResultSampleFractionText, TADA.MethodSpeciationName, and TADA.ResultMeasure.MeasureUnitCode.

# use TADA_FieldValuesTable to create a table of the number of results per TADA.ComparableDataIdentifier
compid_table <- TADA_FieldValuesTable(data, field = "TADA.ComparableDataIdentifier")

DT::datatable(compid_table, fillContainer = TRUE)

As we scroll through the table, we can see that there are some repeated characteristic names. It may be possible that some of these are synonyms which can be harmonized using TADA_HarmonizeSynonyms so their results can be directly compared. However, for purposes of this demo we will not explore harmonization and will instead focus on the two TADA.ComparableDataIdentifiers with the greatest number of results, “TOTAL DISSOLVED SOLIDS_DISSOLVED_NA_UG/L” and “SPECIFIC CONDUCTANCE_TOTAL_NA_US/CM @25C”. For information on harmonization, see the EPATADA Function Reference.

Now, we can filter the data set to retain only the TADA.ComparableDataIdentifiers of interest.

data <- data %>%
  dplyr::filter(TADA.ComparableDataIdentifier %in% c("TOTAL DISSOLVED SOLIDS_DISSOLVED_NA_UG/L", "SPECIFIC CONDUCTANCE_TOTAL_NA_US/CM @25C"))

Water Chemistry Data Preparation and QC

Before performing any analysis, there are some additional data preparation and QC steps that can be performed using TADA functions.

First, we can check the data for additional conditions which might cause us to exclude certain results from further analysis. For example, we can use TADA_FlagMethod to check for invalid analytical method-characteristic combinations; TADA_FlagSpeciation to check for invalid characteristics-method speciation combinations, TADA_FlagResultUnit to check the the validity of each characteristic-media-result unit combination, and TADA_FlagFraction to check the validity of characteristic-fraction combinations. In this simple example, we will remove all invalid combinations identified by each of these functions. When setting up your own workflow, you will want to review each function and its output more carefully before moving on.

data <- data %>%   TADA_FlagMethod(clean = TRUE) %>%   
  TADA_FlagSpeciation(clean = "both") %>%
  TADA_FlagResultUnit(clean = "both") %>% 
  TADA_FlagFraction(clean = TRUE)
## [1] "No invalid method/characteristic combinations in your dataframe. Returning the input dataframe with TADA.AnalyticalMethod.Flag column for tracking."
## [1] "All characteristic/fraction combinations are valid in your dataframe. Returning input dataframe with TADA.SampleFraction.Flag column for tracking."

Another set of TADA flagging functions, TADA_FlagAboveThreshold and TADA_FlagBelowThreshold, can be used to check results against national lower and upper thresholds. As in the previous examples, we will set these functions to remove any results that fall outside the national thresholds.

data <- data %>%   
  TADA_FlagAboveThreshold(clean = TRUE) %>%
  TADA_FlagBelowThreshold(clean = TRUE)
## [1] "No data below the WQX Lower Threshold were found in your dataframe. Returning the input dataframe with TADA.ResultValueBelowLowerThreshold.Flag column for tracking."

We can also review the data set to see if it contains potential duplicate results from within a single organization or potential duplicates from within multiple organizations (such as when two or more organizations monitor the same location and may submit duplicate results).

We will select to keep only unique samples from TADA_FindPotentialDuplicatesSingleOrg by filtering for TADA.SingleOrgDup.Flag equals “Unique”. Then we will search for potential duplicates from multiple orgs with TADA_FindPotentialDuplicatesMultipleOrgs and filter only to retain only one value per potential duplicate by filtering for TADA.ResultSelectedMultipleOrgs equals “Y”.

If you would like to prioritize results from one organization over another, this can be done using the org_hierarchy argument in TADA_FindPotentialDuplicatesMultipleOrgs. Information on how to this can be found in the function reference for TADA_FindPotentialDuplicatesMultipleOrgs. In this example, we will use the default setting “none” where a single representative is selected randomly from each duplicate group.

data <- data %>%   
  TADA_FindPotentialDuplicatesSingleOrg() %>%
  dplyr::filter(TADA.SingleOrgDup.Flag == "Unique") %>% 
  TADA_FindPotentialDuplicatesMultipleOrgs(dist_buffer = 100,   
                                           org_hierarchy = "none") %>% dplyr::filter(TADA.ResultSelectedMultipleOrgs == "Y")
## [1] "39 groups of potentially duplicated results found in dataset. These have been placed into duplicate groups in the TADA.SingleOrgDupGroupID column and the function randomly selected one result from each group to represent a single, unduplicated value. Selected values are indicated in the TADA.SingleOrgDup.Flag as 'Unique', while duplicates are flagged as 'Duplicate' for easy filtering."
## [1] "No nearby sites detected using input buffer distance."
## [1] "No duplicate results detected. Returning input dataframe with duplicate flagging columns set to 'N'."

Next we can remove QC and qualified samples (with qualifiers identified as suspect) using TADA_FindQCActivities and TADA_FlagMeasureQualifierCode, respectively. See the TADA Module 1 vignettes for more information on how these functions identify and flag results for removal as you may wish to make different decisions than the TADA defaults regarding which results to retain.

data <- data %>% 
  TADA_FindQCActivities(clean = TRUE) %>%   
  TADA_FlagMeasureQualifierCode(clean = TRUE)
## [1] "ActivityTypeCode column in dataset contains value(s) Quality Control Sample-Spike Solution which is/are not in the ActivityType WQX domain table. These data records are USGS only values and placed under the TADA.ActivityType.Flag: 'Not Reviewed'. Please review these carefully to detemine data usability."
## [1] "Quality control samples have been removed or were not present in the input dataframe. Returning dataframe with TADA.ActivityType.Flag column for tracking."

Censored Data

We can use a TADA function to identify any censored data in the data set.

data <- TADA_IDCensoredData(data)
## [1] "TADA_IDCensoredData: No censored data detected in your data frame. Returning input dataframe with new column TADA.CensoredData.Flag set to Uncensored"

Because there are no censored data in the data set, we do not need to take any further action. If censored data do exist, TADA_SimpleCensoredMethods could be used. See the TADA Module 1 vignettes or for more information.

Data Retrieval - Geospatial

All geospatial data used in this example are downloaded from the Assessment Total Maximum Daily Load (TMDL) Tracking and Implementation System (ATTAINS) (https://www.epa.gov/waterdata/attains). ATTAINS is an online system for accessing information about conditions in the Nation’s surface waters.

This information reported to EPA by states is available in ATTAINS. The public information is made available via ATTAINS web services. New TADA functions leverage the ATTAINS geospatial services to make geospatial information including catchment and assessment unit geometry easily accessible to R users.

We can use the function, TADA_GetATTAINS to obtain geospatial data from ATTAINS relevant to the Monitoring Locations included in the WQP data set. See TADA Module 2 for a much more detailed look at the logic behind this and the other TADA geospatial functions.

# Import data from ATTAINS geospatial services

ATTAINS_data <- TADA_GetATTAINS(data)

View Geospatial Features

The TADA_ViewATTAINS function allows us to see where the monitoring locations from the WQP data set are relative to ATTAINS-indexed catchments and assessment units. This can be helpful when deciding which Monitoring Locations should be retained for additional analysis. It can also allow us to sub

# View catchments and assessment units on map

ATTAINS_map <- TADA_ViewATTAINS(ATTAINS_data)

ATTAINS_map

Monitoring Location Filter and Review

Now that we’ve associated geospatial data from ATTAINS with the WQP data, we can generate pie charts for each TADA.ComparableDataIdentifier to give us some information about how many results are available per each monitoring location group.

# Select TADA data with ATTAINS data
data <- ATTAINS_data$TADA_with_ATTAINS %>%
  # Remove geometry to reduce size of data set
  sf::st_drop_geometry() %>%
  # Assing "Other" as name for unnamed assessment units
  dplyr::mutate(ATTAINS.assessmentunitname = 
                  ifelse(is.na(ATTAINS.assessmentunitname),
                         "Other", ATTAINS.assessmentunitname))

# Create pie chart of results for each AUID group for total dissolved solids

TADA_FieldValuesPie(data, field = "ATTAINS.assessmentunitname",
                    characteristicName = "TOTAL DISSOLVED SOLIDS")

# Create pie chart of results for each AUID group for specific conductance

TADA_FieldValuesPie(data, field = "ATTAINS.assessmentunitname",
                    characteristicName = "SPECIFIC CONDUCTANCE")

We can also create a table to give us a little more information about assessment unit each monitoring location belongs to, the number of results per each parameter and the oldest and most recent sampling date.

# Create table of monitoring location identifiers
MonitoringLocations <- data %>%
  dplyr::select(MonitoringLocationName, MonitoringLocationIdentifier, OrganizationFormalName,
                ATTAINS.assessmentunitname, TADA.CharacteristicName, ActivityStartDate) %>%
  dplyr::group_by(MonitoringLocationIdentifier) %>%
  dplyr::mutate(n_TDS = length(TADA.CharacteristicName[TADA.CharacteristicName == "TOTAL DISSOLVED SOLIDS"]), 
                n_SpecificConductance = length(TADA.CharacteristicName[TADA.CharacteristicName == "SPECIFIC CONDUCTANCE"]),
                StartDate = min(ActivityStartDate),
                EndDate = max(ActivityStartDate)) %>%
  dplyr::select(MonitoringLocationIdentifier, 
                ATTAINS.assessmentunitname, 
                n_TDS,
                n_SpecificConductance,
                StartDate,
                EndDate) %>%
  dplyr::distinct()

DT::datatable(MonitoringLocations, fillContainer = TRUE)

Now, let’s create subsets of the data by assessment unit name. These will be useful to create some basic data visualizations.

data_animas <- data %>%
  dplyr::filter(ATTAINS.assessmentunitname == "Animas River (San Juan River to Estes Arroyo)")

data_sanjuan <- data %>%
  dplyr::filter(ATTAINS.assessmentunitname == "San Juan River (Navajo bnd at Hogback to Animas River)")

data_other <- data %>%
  dplyr::filter(ATTAINS.assessmentunitname == "Other")

Data Visualization

First, let’s create a two characteristic scatterplot of total dissolved solids and specific conductance from the entire data set to begin visualizing the data, using TADA_TwoCharacteristicScatterplot.

# choose two and generate scatterplot
TADA_TwoCharacteristicScatterplot(data, id_cols = "TADA.ComparableDataIdentifier", groups = c("SPECIFIC CONDUCTANCE_TOTAL_NA_US/CM @25C", "TOTAL DISSOLVED SOLIDS_DISSOLVED_NA_UG/L"))

While there are some older records (including a specific conductance result from 1900), we may decide we are interested only in more recent data. We could filter the data set to exclude any results collected prior to 2015 and recreate the scatterplot.

# choose two and generate scatterplot
TADA_TwoCharacteristicScatterplot(data %>% 
                                    dplyr::filter
                                  (ActivityStartDate > "2014-12-31"), 
                                  id_cols = "TADA.ComparableDataIdentifier", 
                                  groups = c("SPECIFIC CONDUCTANCE_TOTAL_NA_US/CM @25C", 
                                             "TOTAL DISSOLVED SOLIDS_DISSOLVED_NA_UG/L"))

We could also choose to explore one characteristic from one of the assessment units more closely. For example, we could create a histograms for both characteristics for the Animas River only, excluding samples prior to 2014.

TADA_Histogram(data %>% 
                 dplyr::filter(ActivityStartDate > "2014-12-31",
                               TADA.CharacteristicName == 
                                 "SPECIFIC CONDUCTANCE",
                               ATTAINS.assessmentunitname == "Animas River (San Juan River to Estes Arroyo)"),
               id_cols = "TADA.ComparableDataIdentifier")

Or we could create a boxplot for a particular characteristic and assessment unit, for example, total dissolved solids for the San Juan River.

TADA_Boxplot(data %>% 
                 dplyr::filter(ActivityStartDate > "2014-12-31",
                               TADA.CharacteristicName == 
                                 "SPECIFIC CONDUCTANCE",
                               ATTAINS.assessmentunitname == "San Juan River (Navajo bnd at Hogback to Animas River)"),
               id_cols = "TADA.ComparableDataIdentifier")

We might also want to directly compare the same characteristic overtime between the two rivers. We can do this by revisiting TADA.TwoCharacteristicScatterplot.

# create two characteristic scatterplot using TADA_TWoCharacteristicScatterplot
twochar_scatter <- TADA_TwoCharacteristicScatterplot(data %>% 
                 dplyr::filter(ActivityStartDate > "2014-12-31",
                               TADA.ComparableDataIdentifier == 
                                "SPECIFIC CONDUCTANCE_TOTAL_NA_US/CM @25C"),
               id_cols = "ATTAINS.assessmentunitname",
               groups = c("San Juan River (Navajo bnd at Hogback to Animas River)", "Animas River (San Juan River to Estes Arroyo)")) %>%
  # remove default plot features that are not applicable for a location comparison
  plotly::layout(yaxis2 = list(overlaying = "y", side = "right", title = "", visible = FALSE),
                 title = TADA_InsertBreaks("SPECIFIC CONDUCTANCE for the San Juan and Animas Rivers Over Time"))
## [1] "Note: TADA.ComparableDataIdentifier not found in id_cols argument and is highly recommended."
# modify plot titles to reflect group/location names
twochar_scatter <- plotly::plotly_build(twochar_scatter)
twochar_scatter$x$data[[2]]$name <- "San Juan River"
twochar_scatter$x$data[[3]]$name <- "Animas River"

# rebuild graph with corrected legend labels
plotly::plotly_build(twochar_scatter)

We can also use TADA_Stats to generate a summary table for the entire data set or subsets of it for each river.

# create table with all data
data_stats <- TADA_Stats(data)

DT::datatable(data_stats, fillContainer = TRUE)
# create table with animas data
animas_stats <- TADA_Stats(data %>% dplyr::filter(ATTAINS.assessmentunitname == "Animas River (San Juan River to Estes Arroyo)"))

DT::datatable(animas_stats, fillContainer = TRUE)
# create table with san juan data
sanjuan_stats <- TADA_Stats(data %>% dplyr::filter(ATTAINS.assessmentunitname == "San Juan River (Navajo bnd at Hogback to Animas River)"))

DT::datatable(sanjuan_stats, fillContainer = TRUE)

Reproducible and Documented

This workflow is reproducible and the decisions at each step are well documented. This means that it is easy to go back and review every step, understand the decisions that were made, make changes as necessary, and run it again.

For example, we could modify the code to repeat the same analysis at a different location. Or we could run TADA_ConvertResultUnits an additional time outside to convert units for one or more characteristics.

We could change or add additional characteristics to our analysis. We could even write additional custom functions or even use functions from other packages as needed for more complex standards, evaluate trends, or answer other questions using a TADA data frame.

The code chunk below displays the elapsed time it took to run the code chunks and create this document, providing an example of how incorporating TADA in your workflow can increase efficiency.

end.time <- Sys.time()

end.time - start.time
## Time difference of 14.03914 secs