Skip to contents

Overview

We will walk through the specific use case of query, downloading, preparing, QA/QCing data from a single Assessment Unit with multiple Monitoring Locations sampled by several organizations. Three characteristics, pH, water temperature, and total nitrogen are used in this example. Two of the characteristics, pH and water temperature, will be compared to a state water quality standard from Illinois. The third characteristic, total nitrogen, does not have an Illinois water quality standard, but is used to demonstrate some helpful functions for data exploration and visualization.

This vignette is designed to be concise and provide an example of how you might use TADA functions for basic data discovery and analysis with regards to a single Assessment Unit and a small subset of characteristics. It leverages functions from TADA Modules 1 and 2. For more details on each function, see the TADA Function Reference or the more comprehensive module vignettes:

Install and load packages

First, install and load the remotes package specifying the repo. This is needed before installing TADA because it is only available on GitHub (not CRAN).

install.packages("remotes",
  repos = "http://cran.us.r-project.org"
)
library(remotes)

Next, install and load EPATADA using the remotes package.

remotes::install_github("USEPA/EPATADA",
  ref = "develop",
  dependencies = TRUE
)

Load Required Packages

The fist step is to load the TADA package and other packages used in our workflow. (If you are a first-time TADA user, see TADA Module 1 Beginner Training for TADA package installation instructions.)

We’ll also record the start time for our analysis, beginning after loading packages.

list.of.packages <- c(
  "dplyr", "yaml", "lwgeom", "sf", "lubridate",
  "knitr", "DT"
)
new.packages <- list.of.packages[!(list.of.packages %in%
  installed.packages()[, "Package"])]
if (length(new.packages)) install.packages(new.packages)

# Load required packages
library(dplyr)
library(yaml)
library(lwgeom)
library(sf)
library(lubridate)
library(knitr)
library(DT)

# Record start time
start.time <- Sys.time()

Data Retrieval - Water Chemistry

The water chemistry data we will use are downloaded from the Water Quality Portal (https://www.waterqualitydata.us/). The WQP integrates publicly available water quality data from the USGS National Water Information System (NWIS) and the EPA Water Quality Exchange (WQX) Data Warehouse. The EPA water quality data originate from the Water Quality Exchange, the EPA’s repository of water quality monitoring data collected by water resource management groups across the country. Organizations, including states, tribes, watershed groups, other federal agencies, volunteer groups, and universities, submit data to the WQX.

We are interested in Assessment Unit “IL_I-84”. This is a Mississippi River Assessment Unit. Because we don’t have a comprehensive list of the Monitoring Locations in this Assessment Unit, we will need to create a broad query to retrieve all results that might be associated the IL_I-84. One way to do this is to include all HUC10s that are relevant to the Assessment Unit.

We can use TADA_DataRetrieval to retrieve WQP data. In this example, we have specified lists of HUCs and Characteristic Types to reduce the size of the query since we have a general idea of the location of our assessment unit of interest and know which characteristics we’d like to explore. One quick way to find this information is to view IL_I-84’s Waterbody Report on How’s My Waterway: WaterbodyReport/IL_EPA/IL_I-84/2022.

We’ll also need to set start and end dates. For this example we’ll use a ten year range between 2010 and 2020. We will also set the Characteristic Type to “Nutrient” and “Physical”, because these types contain nitrogen, pH and water temperature Characteristics. It is possible to query by a specific Characteristic Name, but different organizations may input the same name in a variety of ways. Therefore, we recommend keeping your initial queries as broad as possible so all relevant results are downloaded and useful data from synonymous Characteristics are not missed. There are TADA functions that can be used to harmonize these synonyms.

# Import data from WQP
data <- TADA_DataRetrieval(
  statecode = "IL",
  startDate = "2010-01-01",
  endDate = "2020-12-31",
  huc = c("0714010505", "0714010504", "0714010508", "0714010501", "0714010503"),
  applyautoclean = TRUE,
  ask = FALSE
)

AutoClean

As part of TADA_DataRetrieval, TADA_AutoClean is performed automatically unless specified otherwise by the user (by setting applyautoclean = FALSE). When TADA_AutoClean is run, the following functions are performed on the data retrieved from the WQP:

  • TADA_ConvertSpecialChars - converts result value columns to numeric and flags non-numeric values that could not be converted.

  • TADA_ConvertResultUnits - unifies result units for easier quality control and review

  • TADA_ConvertDepthUnits - converts depth units to a consistent unit (meters).

  • TADA_IDCensoredData - categorizes detection limit data and identifies mismatches in result detection condition and result detection limit type.

  • Other helpful actions - converts important text columns to all upper-case letters, and uses WQX format rules to harmonize specific NWIS metadata conventions (e.g. move characteristic speciation from the TADA.ResultMeasure.MeasureUnitCode column to the TADA.MethodSpeciationName column)

As a general rule, TADA functions do not change any contents in the WQP-served columns. Instead, they add new columns with the prefix “TADA.” This allows users to easily review any changes TADA functions have made.

Data Retrieval - Geospatial

All geospatial data used in this example are downloaded from the Assessment Total Maximum Daily Load (TMDL) Tracking and Implementation System (ATTAINS) (https://www.epa.gov/waterdata/attains). ATTAINS is an online system for accessing information about conditions in the Nation’s surface waters.

This information reported to EPA by states is available in ATTAINS. The public information is made available via ATTAINS web services. New TADA functions leverage the ATTAINS geospatial services to make geospatial information including catchment and assessment unit geometry easily accessible to R users.

We can use the function, TADA_GetATTAINS to obtain geospatial data from ATTAINS relevant to the Monitoring Locations included in the WQP data set. See TADA Module 2 for a much more detailed look at the logic behind this and the other TADA geospatial functions.

# Import data from ATTAINS geospatial services
ATTAINS_data <- TADA_GetATTAINS(data)

View Geospatial Features

The TADA_ViewATTAINS function allows us to see where the monitoring locations from the WQP data set are relative to ATTAINS-indexed catchments and assessment units. This can be helpful when deciding which Monitoring Locations should be retained for additional analysis. For this demo, we are focusing on a single Assessment Unit, IL-84, with multiple monitoring locations (MonitoringLocationIdentifier).

# View catchments and assessment units on map

ATTAINS_map <- TADA_ViewATTAINS(ATTAINS_data)

ATTAINS_map

Monitoring Location Filter and Review

Now that we’ve associated geospatial data from ATTAINS with the WQP data, we can filter the data set to retain only results that were collected in the specified Assessment Unit. We can also generate a new table to give us some information about the individual monitoring locations within the assessment unit. After this filter step is complete, we can remove the “ATTAINS.” prefixed columns to reduce the size of the data set, as we won’t need them for the remaining steps in this example.

We can also create a table with some basic information about the Monitoring Locations in Assessment Unit IL-84 and a pie chart to display the relative number of results contributed by each organization.

# Filter data for specified assessment unit

AUID_data <- ATTAINS_data$TADA_with_ATTAINS %>%
  dplyr::filter(ATTAINS.assessmentunitidentifier == "IL_I-84")

Analysis_data <- ATTAINS_data$TADA_with_ATTAINS %>%
  dplyr::filter(ATTAINS.assessmentunitidentifier == "IL_I-84") %>%
  dplyr::select(-contains("ATTAINS.")) %>%
  sf::st_drop_geometry() %>%
  TADA_RetainRequired()


# Create table of monitoring location identifiers

MonitoringLocations <- Analysis_data %>%
  dplyr::select(MonitoringLocationName, MonitoringLocationIdentifier, OrganizationFormalName) %>%
  dplyr::distinct()

DT::datatable(MonitoringLocations, fillContainer = TRUE)


# Create pie of results by organization
Orgs_pie <- TADA_FieldValuesPie(Analysis_data, field = "OrganizationFormalName")

Orgs_pie

This allows us to easily review how many monitoring locations, organizations, and results are in the dataframe for the selected time period.

Water Chemistry Data Preparation and QC

Before we can begin comparisons between the water chemistry results and a water quality standard, there are some additional data preparation and QC steps that can be performed using TADA functions.

First, we can check the data for additional conditions which might cause us to exclude certain results from further analysis. For example, we can use TADA_FlagMethod to check for invalid analytical method-characteristic combinations; TADA_FlagSpeciation to check for invalid characteristics-method speciation combinations, TADA_FlagResultUnit to check the the validity of each characteristic-media-result unit combination, and TADA_FlagFraction to check the validity of characteristic-fraction combinations. In this simple example, we will remove all invalid combinations identified by each of these functions. When setting up your own workflow, you will want to review each function and its output more carefully before moving on.

# Flag and remove results

Analysis_data <- Analysis_data %>%
  TADA_FlagMethod(clean = TRUE) %>%
  TADA_FlagSpeciation(clean = "both") %>%
  TADA_FlagResultUnit(clean = "both") %>%
  TADA_FlagFraction(clean = TRUE)

Another set of TADA flagging functions, TADA_FlagAboveThreshold and TADA_FlagBelowThreshold, can be used to check results against national lower and upper thresholds. As in the previous examples, we will set these functions to remove any results that fall outside the national thresholds.

# Flag and remove results

Analysis_data <- Analysis_data %>%
  TADA_FlagAboveThreshold(clean = TRUE) %>%
  TADA_FlagBelowThreshold(clean = TRUE)

We can also review the data set to see if it contains potential duplicate results from within a single organization or potential duplicates from within multiple organizations (such as when two or more organizations monitor the same location and may submit duplicate results).

We will select to keep only unique samples from TADA_FindPotentialDuplicatesSingleOrg by filtering for TADA.SingleOrgDup.Flag equals “Unique”. Then we will search for potential duplicates from multiple orgs with TADA_FindPotentialDuplicatesMultipleOrgs and filter only to retain only one value per potential duplicate by filtering for TADA.ResultSelectedMultipleOrgs equals “Y”.

If you would like to prioritize results from one organization over another, this can be done using the org_hierarchy argument in TADA_FindPotentialDuplicatesMultipleOrgs. Information on how to this can be found in the function reference for TADA_FindPotentialDuplicatesMultipleOrgs. In this example, we will use the default setting “none” where a single representative is selected randomly from each duplicate group.

# Flag and remove results

Analysis_data <- Analysis_data %>%
  TADA_FindPotentialDuplicatesSingleOrg() %>%
  dplyr::filter(TADA.SingleOrgDup.Flag == "Unique") %>%
  TADA_FindPotentialDuplicatesMultipleOrgs(
    dist_buffer = 100,
    org_hierarchy = "none"
  ) %>%
  dplyr::filter(TADA.ResultSelectedMultipleOrgs == "Y")

Next we can remove QC and qualified samples (with qualifiers identified as suspect) using TADA_FindQCActivities and TADA_FlagMeasureQualifierCode, respectively. See the TADA Module 1 vignettes for more information on how these functions identify and flag results for removal as you may wish to make different decisions than the TADA defaults regarding which results to retain.

# Flag and remove results

Analysis_data <- Analysis_data %>%
  TADA_FindQCActivities(clean = TRUE) %>%
  TADA_FlagMeasureQualifierCode(clean = TRUE)

Censored Data

Censored data are measurements for which the true value is not known, but we can estimate the value based on lower or upper detection conditions and limit types. TADA fills missing TADA.ResultMeasureValue and TADA.ResultMeasure.MeasureUnitCode values with values and units from TADA.DetectionQuantitationLimitMeasure.MeasureValue and TADA.DetectionQuantitationLimitMeasure.MeasureUnitCode, respectively, using the TADA_AutoClean function.

The TADA package currently has functions that summarize censored data incidence in the dataset and perform simple substitutions of censored data values, including x times the detection limit and random selection of a value between 0 and the detection limit. The user may specify the methods used for non-detects and over-detects separately in the input to the TADA_SimpleCensoredMethods function. The next step we take in this example is to perform simple conversions to the censored data in the dataset: we keep over-detects as is (no conversion made) and convert non-detect values to 0.5 times the detection limit (half the detection limit).

See the TADA Module 1 vignettes for more detailed information on the logic behind TADA_SimpleCensoredMethods.

After running TADA_SimpleCensoredMethods, we will also filter the data to remove any remaining NAs.

Analysis_data <- TADA_SimpleCensoredMethods(Analysis_data,
  nd_method = "multiplier",
  nd_multiplier = 0.5,
  od_method = "as-is",
  od_multiplier = "null"
) %>%
  dplyr::filter(!is.na(TADA.ResultMeasureValue))

Harmonize Results

The TADA_GetSynonymRef function generates a synonym reference table that is specific to the input dataframe. Users can review how their input data relates to standard TADA values for the following elements: TADA.CharacteristicName, TADA.ResultSampleFractionText, TADA.MethodSpeciationName, and TADA.ResultMeasure.MeasureUnitCode and edit if desired.

Users can also edit the reference file to meet their needs if desired. The download argument can be used to save the harmonization file to your current working directory when download = TRUE, the default is download = FALSE. In this example, we will modify the harmonization file to group results with the fraction “NA” and results with the fraction “TOTAL” for pH and water temperature.

The TADA_HarmonizeSynonyms function then compares the input dataframe to the TADA Synonym Reference Table and makes conversions where target characteristics/fractions/speciations/units are provided. This function also appends a column called TADA.Harmonized.Flag, indicating which results had metadata changed/converted in this function. The purpose of this function is to make similar data consistent and therefore easier to compare and analyze.

UniqueHarmonizationRef <- TADA_GetSynonymRef(Analysis_data)

UniqueHarmonizationRef_edit <- UniqueHarmonizationRef %>%
  dplyr::mutate(
    Target.TADA.CharacteristicName = ifelse(TADA.CharacteristicName == "TEMPERATURE, WATER", "TEMPERATURE", TADA.CharacteristicName),
    Target.TADA.ResultSampleFractionText = ifelse(TADA.CharacteristicName == "PH", NA, Target.TADA.ResultSampleFractionText),
    HarmonizationGroup = ifelse(TADA.CharacteristicName == "PH", "PH", HarmonizationGroup),
    HarmonizationGroup = ifelse(Target.TADA.CharacteristicName == "TEMPERATURE", "TEMPERATURE",
      HarmonizationGroup
    )
  )

Harmonized_data <- TADA_HarmonizeSynonyms(Analysis_data,
  ref = UniqueHarmonizationRef_edit
)

Total Nitrogen and Total Phosphorus Calculations

TADA_CalculateTotalNP uses the Nutrient Aggregation logic to add together specific subspecies to obtain a total. TADA adds one more equation to the mix: total particulate nitrogen + total dissolved nitrogen. The function uses as many subspecies as possible to calculate a total for each given site, date, and depth group, but it will estimate total nitrogen with whatever subspecies are present. This function creates NEW total nutrient measurements (total nitrogen unfiltered as N and total phosphorus unfiltered as P) and adds them to the dataframe.

Users can use the default summation worksheet (see TADA_GetNutrientSummationRef) or customize it to suit their needs. The function also requires a daily aggregation value, either minimum, maximum, or mean. The default is ‘max’, which means that if multiple measurements of the same subspecies-fraction-speciation-unit occur on the same day at the same site and depth, the function will pick the maximum value to use in summation calculations. In this example, we will use the default summation worksheet to obtain total nitrogen.

Harmonized_data <- TADA_CalculateTotalNP(Harmonized_data, daily_agg = "max")

Parameter Level Filtering

Now we can filter to our three parameters of interest and use TADA_Stats take a look at some basic statistics for each, including location count, measurement count, min, max, and more.

# review unique identifiers
unique(Harmonized_data$TADA.ComparableDataIdentifier)

# filter for three comparable data identifiers of interest
Filtered_data <- Harmonized_data %>%
  dplyr::filter(TADA.ComparableDataIdentifier %in% c("TEMPERATURE_NA_NA_DEG C", "PH_NA_NA_STD UNITS", "TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L"))

# generate stats table
Filtered_data_stats <- TADA_Stats(Filtered_data)

DT::datatable(Filtered_data_stats, fillContainer = TRUE)

Standard Comparisons With Scatterplot and Pie Chart

Next, we will create a two characteristic scatterplot of pH and temperature to begin visualizing the data, using TADA_TwoCharacteristicScatterplot.

# choose two and generate scatterplot
TADA_TwoCharacteristicScatterplot(Harmonized_data, id_cols = "TADA.ComparableDataIdentifier", groups = c("TEMPERATURE_NA_NA_DEG C", "PH_NA_NA_STD UNITS"))

Let’s focus on just one characteristic, pH. We can filter the dataframe to retain only pH samples. We can also add a column to indicate whether each result falls within the water quality standard range for pH (6.5 - 9). From this pH-only frame we can create a table with just a few columns for easier review, and a single characteristic scatterplot using TADA_Scatterplot. In this example, horizontal lines have been added to indicate the upper and lower ends of the water quality standard. These are not a default option in TADA_Scatterplot but can be added via the plotly package.

# comparison to standard for ph
pH_Standard <- Filtered_data %>%
  dplyr::filter(TADA.ComparableDataIdentifier == "PH_NA_NA_STD UNITS") %>%
  dplyr::mutate(MeetsStandard = ifelse(TADA.ResultMeasureValue >= 6.5 & TADA.ResultMeasureValue <= 9, "Yes", "No"))


# subset of pH data with fewer rows
pH_Table <- pH_Standard %>%
  dplyr::select(
    MonitoringLocationIdentifier, OrganizationFormalName, ActivityStartDate, TADA.ResultMeasureValue,
    MeetsStandard
  )

DT::datatable(pH_Table, fillContainer = TRUE)

# pH scatterplot
pH_Scatter <- TADA_Scatterplot(pH_Standard, id_cols = "TADA.ComparableDataIdentifier") %>%
  plotly::add_lines(
    y = 6.5,
    x = c(min(pH_Standard$ActivityStartDate), max(pH_Standard$ActivityStartDate)),
    inherit = FALSE,
    showlegend = FALSE,
    line = list(color = "red"),
    hoverinfo = "none"
  ) %>%
  plotly::add_lines(
    y = 9,
    x = c(min(pH_Standard$ActivityStartDate), max(pH_Standard$ActivityStartDate)),
    inherit = FALSE,
    showlegend = FALSE,
    line = list(color = "red"),
    hoverinfo = "none"
  )

pH_Scatter

Comparison of temperature to water quality standards is a little more complicated as the maximum temperature varies by season and there are multiple components of the temperature standard. For this example, we will focus on the “shall never exceed” seasonal temperature standards.

These are:

17.7 deg C for January, February, March, and December

33.7 deg C for April, May, June, July, August, September, October, and November.

We can use lubridate and dplyr functions to assign the appropriate standard to each result by using lubridate::month to identify the month in which each sample was collected and dplyr::mutate to create a new column for the temperature standard. Then the results can be compared to the standard.

We can use TADA_Scatterplot again to visualize the data. Due to the seasonal variation in the standard and the ten-year date range in our data set, we may want to consider other visualizations to visually review the number of results not meeting the standard. We can use TADA_FieldValuesPie on the MeetsStandard field we created to see how many samples met or did not meet the standard.

Creating a table also allows for easy filtering of results to identify the results that did not meet the standard and review during which years and seasons they occurred.

# comparison to standard for temperature
Temp_Standard <- Filtered_data %>%
  dplyr::filter(TADA.ComparableDataIdentifier == "TEMPERATURE_NA_NA_DEG C") %>%
  dplyr::mutate(
    MonthForAnalysis = lubridate::month(ActivityStartDate),
    TempStandard = ifelse(MonthForAnalysis %in% c(1, 2, 3, 12), 17.7, 33.7),
    MeetsStandard = ifelse(TADA.ResultMeasureValue < TempStandard,
      "Yes", "No"
    )
  )

# create scatterplot for temperature
Temp_Scatter <- TADA_Scatterplot(Temp_Standard, id_cols = "TADA.ComparableDataIdentifier")

Temp_Scatter

# create pie chart for temperature

Temp_Pie <- TADA_FieldValuesPie(Temp_Standard, field = "MeetsStandard")

Temp_Pie

# create subset table for temperature
Temp_Table <- Temp_Standard %>%
  dplyr::select(
    MonitoringLocationIdentifier, OrganizationFormalName, ActivityStartDate, TADA.ResultMeasureValue, TempStandard,
    MeetsStandard
  )

DT::datatable(Temp_Table, fillContainer = TRUE)

Additional Data Exploration

We may also want to use TADA functions for exploration and visualization of characteristics without water quality standards. For this example, we will use “TOTAL NITROGEN, MIXED FORMS_FILTERED, LAB_NA_MG/L”.

To better understand the distribution of results, we can use the TADA functions TADA_Histogram and TADA_Boxplot.

TADA_Histogram can be useful for identifying the overall shape of the data.

# filter dataframe to comparable data identifier of interest

Nitrogen_data <- dplyr::filter(Filtered_data, TADA.ComparableDataIdentifier == "TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L")

# generate a histogram
Nitrogen_Histogram <- TADA_Histogram(Nitrogen_data, id_cols = "TADA.ComparableDataIdentifier")

# view histogram
Nitrogen_Histogram

TADA_Boxplot can be useful for identifying skewness and percentiles.

Nitrogen_Boxplot <- TADA_Boxplot(Nitrogen_data, id_cols = "TADA.ComparableDataIdentifier")


Nitrogen_Boxplot

Filtering the results of TADA_Stats for only “TOTAL NITROGEN, MIXED FORMS_TOTAL RECOVERABLE_NA_MG/L” and selecting a smaller subset of columns or creating a single characteristic scatterplot may also provide useful information.

# create table with nitrogen stats
Nitrogen_stats <- Filtered_data_stats %>%
  dplyr::filter(TADA.ComparableDataIdentifier == "TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L") %>%
  dplyr::select(Location_Count, Measurement_Count, Min, Max, Mean)

DT::datatable(Nitrogen_stats, fillContainer = TRUE)

# create nitrogen scatterplot
Nitrogen_Scatterplot <- TADA_Scatterplot(Nitrogen_data, id_cols = "TADA.ComparableDataIdentifier")

Nitrogen_Scatterplot

Reproducible and Documented

Major benefits to this type are workflow are that it is reproducible and the decisions at each step are well documented. This means that it is easy to go back and review every step, understand the decisions that were made, make changes as necessary, and run it again. If someone asks “How were data filtered prior to analysis?” or “How did you identify additional Monitoring Locations within the Assessment Unit?”, you can refer to the code and function documentation to provide answers.

For example, we could change the Assessment Unit of interest, modify the relevant code chunks (potentially including the WQP query if the new Assessment Unit is in a different HUC or state) and repeat the same analysis for the same characteristics at a different location. We could also modify the harmonization reference to group additional comparable data (if appropriate) or add a step to run TADA_ConvertResultUnits an additional time outside of TADA_AutoClean if we wanted to convert temperature results to deg F for analysis.

Or we could change or add additional characteristics to our analysis. We could even write additional custom functions or even use functions from other packages as needed for more complex standards, evaluate trends, or answer other questions using a TADA dataframe.

The code chunk below displays the elapsed time it took to run the code chunks and create this document, providing an example of how incorporating TADA in your workflow can increase efficiency.

end.time <- Sys.time()

end.time - start.time