Skip to contents

Overview

Let’s walk through how to create an efficient and reproducible workflow that integrates several R Packages developed by the U.S. Environmental Protection Agency (EPA) and the U.S. Geological Survey (USGS) to support water quality programs (such as the Clean Water Act) and geospatial (watershed or waterbody) level analyses.

This workflow demonstrates potential uses (beyond their original collection purpose) for publicly available water quality data from WQP. To start, participants will learn how to use EPA’s Tools for Automated Data Analysis (TADA) R Package to retrieve, wrangle, harmonize, quality check, visualize and analyze WQP data from multiple organizations.

Next, we will showcase how to bring in other web services and libraries for easy integration of additional hydrologic and geospatial data. We then plan to touch briefly on packages that can assist with building statistical models. Finally, we will demonstrate an example for analyzing water quality by Assessment Units (AUs), which are state or tribal nation defined watershed or waterbody areas used for CWA assessments and reporting water quality conditions to EPA and the public.

Intended Audience

Water Quality eXchange (WQX) and Water Quality Portal (WQP) community, Clean Water Act (CWA) community (EPA, States and Tribal Nations), water quality and geospatial data analysts/researchers, EPA/USGS and other federal agencies.

Leveraged R Packages

  • EPA: EPATADA, StreamCatTools, spsurvey, spmodel, SSN2

  • USGS: dataRetrieval, nhdplusTools, hydroloom

  • Fundamental geospatial packages: sf, prism, terra, leaflet and tmap

Install and load packages

We will be leveraging the EPATADA R Package for WQP data retrieval, cleaning, visualization and other steps needed to prepare for analysis. Let’s dive into Green Bay, WI!

First, install and load the remotes package specifying the repo. This is needed before installing EPATADA because it is only available on GitHub (not CRAN).

install.packages("remotes",
  repos = "http://cran.us.r-project.org"
)
library(remotes)

Next, install and load EPATADA using the remotes package. All other dependencies for this workflow will be downloaded automatically. We will also need to install StreamCatTools for some steps later in the workflow.

remotes::install_github("USEPA/EPATADA", 
                        ref = "develop",
                        dependencies = TRUE,
                        force = TRUE)

remotes::install_github("USEPA/StreamCatTools", 
                        ref = "master",
                        dependencies = TRUE,
                        force = TRUE)

library(EPATADA)
library(StreamCatTools)

It’s go time! Let’s time our process.

# Record start time
start.time <- Sys.time()

WQP data discovery and cleaning

This is an abbreviated introduction to key TADA Module 1 WQP Data Discovery and Cleaning functions. Additional functions and a more detailed example workflow is available here.

Retrieve data from the WQP

In this example, we will first use EPA’s How’s My Waterway (HMW) application to find an applicable Hydrologic Unit Code (HUC) for our area of interest - the Fox River, Green Bay, WI. Next, let’s query the WQP using the identified HUC, state abbreviation, and a date range. In this example, we’ll start by pulling all data available in the WQP for this HUC 12 in Wisconsin for the last 5 years.

WATERSHED: City of Green Bay-Fox River (040302040405)

# Uncomment to query the WQP
GreenBay_FoxRiver <- TADA_DataRetrieval(
  statecode = "WI",
  startDate = "2015-01-01",
  endDate = "2024-12-30",
  huc = c("040302040405"),
  applyautoclean = TRUE,
  ask = FALSE
)
## [1] "Downloading WQP query results. This may take some time depending upon the query size."
## $statecode
## [1] "US:55"
## 
## $huc
## [1] "040302040405"
## 
## $startDate
## [1] "2015-01-01"
## 
## $endDate
## [1] "2024-12-30"
## 
## [1] "Data successfully downloaded. Running TADA_AutoClean function."
## [1] "TADA_Autoclean: creating TADA-specific columns."
## [1] "TADA_Autoclean: harmonizing dissolved oxygen characterisic name to DISSOLVED OXYGEN SATURATION if unit is % or % SATURATN."
## [1] "TADA_Autoclean: handling special characters and coverting TADA.ResultMeasureValue and TADA.DetectionQuantitationLimitMeasure.MeasureValue value fields to numeric."
## [1] "TADA_Autoclean: converting TADA.LatitudeMeasure and TADA.LongitudeMeasure fields to numeric."
## [1] "TADA_Autoclean: harmonizing synonymous unit names (m and meters) to m."
## [1] "TADA_Autoclean: updating deprecated (i.e. retired) characteristic names."
## [1] "200 results in your dataset have one of the following deprecated characteristic names: 1-Octanesulfonic acid, 3,3,4,4,5,5,6,6,7,7,8,8,8-tridecafluoro-; 1,2-Benzenedicarboxamide, N2-[1,1-dimethyl-2-(methylsulfonyl)ethyl]-3-iodo-N1-[2-methyl-4-[1,2,2,2-tetrafluoro-1-(trifluoromethyl)ethyl]phenyl]-; Decabromodiphenyl ether; Desisopropyl atrazine; Inorganic nitrogen (nitrate and nitrite); Inorganic nitrogen (nitrate and nitrite) ***retired***use Nitrate + Nitrite; N-ethyl Perfluorooctane sulfonamide. These names have been substituted with the updated preferred names in the TADA.CharacteristicName field."
## [1] "TADA_Autoclean: harmonizing result and depth units."
## [1] "TADA_Autoclean: creating TADA.ComparableDataIdentifier field for use when generating visualizations and analyses."
## [1] "NOTE: This version of the TADA package is designed to work with numeric data with media name: 'WATER'. TADA_AutoClean does not currently remove (filter) data with non-water media types. If desired, the user must make this specification on their own outside of package functions. Example: dplyr::filter(.data, TADA.ActivityMediaName == 'WATER')"
# GreenBay_FoxRiver <- NMCWorkshopData::GreenBay_FoxRiver

Wrangle

Now, let’s use EPATADA functions to review, visualize, and whittle the returned WQP data down to include only results that are applicable to our water quality analysis and area of interest.

Autoclean

In the default arguments for TADA_DataRetrieval, applyautoclean = TRUE. This runs TADA_AutoClean on the newly retrieved data frame. TADA_AutoClean is a powerful function which performs a variety of tasks including: (1) creating new “TADA” prefixed columns and and capitalizing their contents to reduce case sensitivity issues, (2) converts special characters in value columns, (3) converts latitude and longitude values to numeric, (4) replaces “meters” with “m”, (5) replaces deprecated characteristic names with current WQX names, (6) harmonizes result and detection limit units to WQX, TADA or user supplied target units, (7) converts depths to meters, and (8) creates the column TADA.ComparableID by concatenating characteristic name, result sample fraction, method speciation, and result measure unit.

Flag and remove duplicate results from a single organization

We can identify data records uploaded by the same organization with the same date, time, monitoring location, activity type, characteristic name, fraction, taxonomic name, depth columns, and result value and flags them as potential duplicates. The data user must determine if the data records are unique or represent overlap that could cause issues in analysis. For this example, we will retain only results flagged as “Unique”.

# find duplicate results submitted by single org
GreenBay_FoxRiver <- TADA_FindPotentialDuplicatesSingleOrg(GreenBay_FoxRiver)
## [1] "TADA_FindPotentialDuplicatesSingleOrg: 117 groups of potentially duplicated results found in dataset. These have been placed into duplicate groups in the TADA.SingleOrgDupGroupID column and the function randomly selected one result from each group to represent a single, unduplicated value. Selected values are indicated in the TADA.SingleOrgDup.Flag as 'Unique', while duplicates are flagged as 'Duplicate' for easy filtering."
# retain unique flagged results
GreenBay_FoxRiver <- dplyr::filter(GreenBay_FoxRiver, TADA.SingleOrgDup.Flag == "Unique")

Censored results

TADA provides some simple methods for dealing with censored results, such as using multiplying the detection limit by a user supplied value or leaving the result as is.

# substitude nondetects with 0.5 detection limit, leave overdetects as is
GreenBay_FoxRiver <- TADA_SimpleCensoredMethods(GreenBay_FoxRiver, nd_method = "multiplier", nd_multiplier = 0.5, od_method = "as-is", od_multiplier = "null")
## [1] "Dataframe does not include any information (all NA's) in MeasureQualifierCode."
## [1] "TADA_IDCensoredData: There are 9 results in your dataframe that are missing ResultDetectionConditionText. TADA requires BOTH ResultDetectionConditionText and DetectionQuantitationLimitTypeName fields to be populated in order to categorize censored data."

Flag and remove duplicates from multiple organizations

Two organizations sometimes submit the same exact data to WQP. Filtering out these duplicates can prevent issues in analysis.

# find potential dups multiple orgs
GreenBay_FoxRiver <- TADA_FindPotentialDuplicatesMultipleOrgs(GreenBay_FoxRiver)

# filter out 
GreenBay_FoxRiver <- dplyr::filter(GreenBay_FoxRiver, TADA.ResultSelectedMultipleOrgs == "Y")

Filter out any remaining irrelevant data, NA’s and empty columns

unique(GreenBay_FoxRiver$TADA.ResultMeasureValueDataTypes.Flag)
## [1] "Numeric"                                         
## [2] "Result Value/Unit Estimated from Detection Limit"
## [3] "Result Value/Unit Copied from Detection Limit"   
## [4] "Greater Than"                                    
## [5] "Text"                                            
## [6] "Coerced to NA"
sum(is.na(GreenBay_FoxRiver$TADA.ResultMeasureValue))
## [1] 63
GreenBay_FoxRiver <- TADA_AutoFilter(GreenBay_FoxRiver)
## [1] "Quality control samples have been removed or were not present in the input dataframe. Returning dataframe with TADA.ActivityType.Flag column for tracking."
## [1] "TADA_Autofilter: removing columns not required for TADA workflow if they contain only NAs."
## [1] "The following column(s) were removed as they contained only NAs: ActivityDepthAltitudeReferencePointText, SampleAquifer, BinaryObjectFileName, BinaryObjectFileTypeCode, LabSamplePreparationUrl, FormationTypeText, ProjectMonitoringLocationWeightingUrl, ContributingDrainageAreaMeasure.MeasureValue and ContributingDrainageAreaMeasure.MeasureUnitCode."
## [1] "TADA_Autofilter: checking required columns for non-NA values."
## [1] "TADA Required column(s) ResultDepthAltitudeReferencePointText, ActivityRelativeDepthName, ActivityTopDepthHeightMeasure.MeasureValue, TADA.ActivityTopDepthHeightMeasure.MeasureValue, ActivityTopDepthHeightMeasure.MeasureUnitCode, TADA.ActivityTopDepthHeightMeasure.MeasureUnitCode, ActivityBottomDepthHeightMeasure.MeasureValue, TADA.ActivityBottomDepthHeightMeasure.MeasureValue, ActivityBottomDepthHeightMeasure.MeasureUnitCode, TADA.ActivityBottomDepthHeightMeasure.MeasureUnitCode, ResultTimeBasisText, ResultFileUrl, ResultAnalyticalMethod.MethodUrl, SampleCollectionMethod.MethodDescriptionText, MeasureQualifierCode, DataQuality.PrecisionValue, DataQuality.BiasValue, DataQuality.ConfidenceIntervalValue, DataQuality.UpperConfidenceLimitValue, DataQuality.LowerConfidenceLimitValue, SamplingDesignTypeCode, ProjectFileUrl, QAPPApprovedIndicator, QAPPApprovalAgencyName, AquiferName, AquiferTypeName, LocalAqfrName, ConstructionDateText, WellDepthMeasure.MeasureValue, WellDepthMeasure.MeasureUnitCode, WellHoleDepthMeasure.MeasureValue and WellHoleDepthMeasure.MeasureUnitCode contain only NA values. This may impact other TADA functions."
## [1] "Function removed 65 results. These results are either text or NA and cannot be plotted or represent quality control activities (not routine samples or measurements)."

Check to make sure there are no more NA’s in TADA.ResultMeasureValue.

unique(GreenBay_FoxRiver$TADA.ResultMeasureValueDataTypes.Flag)
## [1] "Numeric"                                         
## [2] "Result Value/Unit Estimated from Detection Limit"
## [3] "Result Value/Unit Copied from Detection Limit"   
## [4] "Greater Than"
sum(is.na(GreenBay_FoxRiver$TADA.ResultMeasureValue))
## [1] 0

Flag and remove QAQC samples and suspect results

GreenBay_FoxRiver <- TADA_RunKeyFlagFunctions(GreenBay_FoxRiver, clean = TRUE)
## [1] "Quality control samples have been removed or were not present in the input dataframe. Returning dataframe with TADA.ActivityType.Flag column for tracking."
## [1] "Dataframe does not include any information (all NA's) in MeasureQualifierCode."

Flag results above and below threshold, but do not remove them

GreenBay_FoxRiver <- TADA_FlagAboveThreshold(GreenBay_FoxRiver, clean = FALSE, flaggedonly = FALSE)

GreenBay_FoxRiver <- TADA_FlagBelowThreshold(GreenBay_FoxRiver, clean = FALSE, flaggedonly = FALSE)

Harmonize synonyms across characteristic, fraction, and speciation

GreenBay_FoxRiver <- TADA_HarmonizeSynonyms(GreenBay_FoxRiver)

Calculate Total N and Total P from various species and fractions

GreenBay_FoxRiver <- TADA_CalculateTotalNP(GreenBay_FoxRiver, daily_agg = "max")
## [1] "Aggregation results:"
## 
##           No aggregation needed Selected as max aggregate value 
##                           11204                             874

Review unique characteristic, fraction, and species combinations

GreenBay_FoxRiver_Counts <- TADA_FieldValuesTable(GreenBay_FoxRiver, field = "TADA.ComparableDataIdentifier")

DT::datatable(GreenBay_FoxRiver_Counts, fillContainer = TRUE)

Filter to focus on frequently monitored characteristics in example data

GreenBay_FoxRiver_Subset <- GreenBay_FoxRiver %>%
  dplyr::filter(TADA.ComparableDataIdentifier %in%
      c("SPECIFIC CONDUCTANCE_NA_NA_US/CM", 
      "PH_NA_NA_NA", 
      "TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L",
      "TOTAL PHOSPHORUS, MIXED FORMS_UNFILTERED_AS P_UG/L",
      "DISSOLVED OXYGEN (DO)_NA_NA_MG/L"))

Review organizations for subset

# Create pie of results by organization
TADA_FieldValuesPie(GreenBay_FoxRiver_Subset, field = "OrganizationFormalName")

Exploratory visualizations

Generate stats table

GreenBay_FoxRiver_Subset_Stats <- TADA_Stats(GreenBay_FoxRiver_Subset)
## [1] "Note: Your dataset contains TADA-generated total nutrient results, which have fewer columns populated with metadata. This might affect how groups are displayed in the stats table."
DT::datatable(GreenBay_FoxRiver_Subset_Stats, fillContainer = TRUE)

Generate scatterplot

TADA_TwoCharacteristicScatterplot(GreenBay_FoxRiver_Subset, id_cols = "TADA.ComparableDataIdentifier",  groups = c("TOTAL PHOSPHORUS, MIXED FORMS_UNFILTERED_AS P_UG/L", "TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L"))

Generate map

TADA_OverviewMap(GreenBay_FoxRiver_Subset)

Clean up coordinate issues

# Change coordinate sign if appropriate
GreenBay_FoxRiver = TADA_FlagCoordinates(GreenBay_FoxRiver_Subset, clean_outsideUSA = "change sign", clean_imprecise = FALSE)
## [1] "When clean_outsideUSA == change sign, the sign for any lat/long coordinates flagged as outside of USA are switched. This is a temporary solution. Data owners should fix the raw data to address Suspect coordinates through WQX. For assistance fixing data errors you see in the WQP, email the WQX helpdesk (WQX@epa.gov)."
# This df has NA lons from USGS that must be addressed before TADA_MakeSpatial can be run... 
sum(is.na(GreenBay_FoxRiver_Subset$LongitudeMeasure))
## [1] 601
# Remove rows with NA lons from df
GreenBay_FoxRiver_Subset <- GreenBay_FoxRiver_Subset[!is.na(GreenBay_FoxRiver_Subset$LongitudeMeasure),]

# Recheck
sum(is.na(GreenBay_FoxRiver_Subset$LongitudeMeasure))
## [1] 0

Geospatial data integration

Make spatial

Leverage TADA_MakeSpatial to transform a WQP dataframe into a geospatial sf object.

GreenBay_FoxRiver_sf = TADA_MakeSpatial(GreenBay_FoxRiver_Subset)

Then create a unique identifier based on shared lat long values and filter to just the 25 unique locations.

GreenBay_FoxRiver_sf$latlon <- paste0(GreenBay_FoxRiver_sf$TADA.LongitudeMeasure, GreenBay_FoxRiver_sf$TADA.LatitudeMeasure)

GreenBay_FoxRiver_sf <- GreenBay_FoxRiver_sf |> 
  dplyr::group_by(latlon) |> 
  dplyr::mutate(loc_id = dplyr::cur_group_id())

GreenBay_FoxRiver_sf_locs <- GreenBay_FoxRiver_sf |>
  dplyr::filter(!duplicated(loc_id))

Access NHDPlus COMIDs for sites

We use StreamCatTools function sc_get_comid (which uses an nhdplusTools web service client) to get the comid for each location.

GreenBay_FoxRiver_sf_locs$COMID <- as.integer(strsplit(StreamCatTools::sc_get_comid(GreenBay_FoxRiver_sf_locs), split = ",")[[1]])

nhdplus_data <- nhdplusTools::subset_nhdplus(GreenBay_FoxRiver_sf_locs$COMID, nhdplus_data = "download")

outlet <- dplyr::filter(nhdplus_data$NHDFlowline_Network, hydroseq == min(hydroseq)) 

nhdplusTools::plot_nhdplus(bbox = sf::st_bbox(outlet))
plot(sf::st_transform(sf::st_geometry(GreenBay_FoxRiver_sf_locs), 3857), add = TRUE)

dataRetrieval/NLDI, nhdplusTools, hydroloom

Do a network navigation and get NHDPlus for our data. Note that the network navigation only includes flowline geometry. nhdplusTools subsets all of the NHDPlus.

all_network <- dataRetrieval::findNLDI(comid = outlet$comid, nav = "UT", distance_km = 500)

# we could select only comids on network
if(FALSE) # don't run this one
nhdplus_data <- nhdplusTools::subset_nhdplus(comids = as.integer(all_network$UT_flowlines$nhdplus_comid), nhdplus_data = "download", flowline_only = FALSE)

# or we could just get everything in the bbox to be sure we get non-network stuff too!
nhdplus_data <- nhdplusTools::subset_nhdplus(
  bbox = sf::st_bbox(all_network$UT_flowlines), 
  nhdplus_data = "download", 
  flowline_only = FALSE)

# see ?nhdplusTools::subset_nhdplus for lots more options!

sapply(nhdplus_data, nrow)
##            CatchmentSP    NHDFlowline_Network                NHDArea 
##                  10522                  10599                    104 
##           NHDWaterbody NHDFlowline_NonNetwork 
##                   3988                    410
sapply(nhdplus_data, names)
## $CatchmentSP
## [1] "gridcode"     "featureid"    "sourcefc"     "areasqkm"     "shape_length"
## [6] "shape_area"   "geometry"    
## 
## $NHDFlowline_Network
##   [1] "comid"        "fdate"        "resolution"   "gnis_id"      "gnis_name"   
##   [6] "lengthkm"     "reachcode"    "flowdir"      "wbareacomi"   "ftype"       
##  [11] "fcode"        "shape_length" "streamleve"   "streamorde"   "streamcalc"  
##  [16] "fromnode"     "tonode"       "hydroseq"     "levelpathi"   "pathlength"  
##  [21] "terminalpa"   "arbolatesu"   "divergence"   "startflag"    "terminalfl"  
##  [26] "dnlevel"      "uplevelpat"   "uphydroseq"   "dnlevelpat"   "dnminorhyd"  
##  [31] "dndraincou"   "dnhydroseq"   "frommeas"     "tomeas"       "rtndiv"      
##  [36] "vpuin"        "vpuout"       "areasqkm"     "totdasqkm"    "divdasqkm"   
##  [41] "tidal"        "totma"        "wbareatype"   "pathtimema"   "hwnodesqkm"  
##  [46] "maxelevraw"   "minelevraw"   "maxelevsmo"   "minelevsmo"   "slope"       
##  [51] "elevfixed"    "hwtype"       "slopelenkm"   "qa_ma"        "va_ma"       
##  [56] "qc_ma"        "vc_ma"        "qe_ma"        "ve_ma"        "qa_01"       
##  [61] "va_01"        "qc_01"        "vc_01"        "qe_01"        "ve_01"       
##  [66] "qa_02"        "va_02"        "qc_02"        "vc_02"        "qe_02"       
##  [71] "ve_02"        "qa_03"        "va_03"        "qc_03"        "vc_03"       
##  [76] "qe_03"        "ve_03"        "qa_04"        "va_04"        "qc_04"       
##  [81] "vc_04"        "qe_04"        "ve_04"        "qa_05"        "va_05"       
##  [86] "qc_05"        "vc_05"        "qe_05"        "ve_05"        "qa_06"       
##  [91] "va_06"        "qc_06"        "vc_06"        "qe_06"        "ve_06"       
##  [96] "qa_07"        "va_07"        "qc_07"        "vc_07"        "qe_07"       
## [101] "ve_07"        "qa_08"        "va_08"        "qc_08"        "vc_08"       
## [106] "qe_08"        "ve_08"        "qa_09"        "va_09"        "qc_09"       
## [111] "vc_09"        "qe_09"        "ve_09"        "qa_10"        "va_10"       
## [116] "qc_10"        "vc_10"        "qe_10"        "ve_10"        "qa_11"       
## [121] "va_11"        "qc_11"        "vc_11"        "qe_11"        "ve_11"       
## [126] "qa_12"        "va_12"        "qc_12"        "vc_12"        "qe_12"       
## [131] "ve_12"        "lakefract"    "surfarea"     "rareahload"   "rpuid"       
## [136] "vpuid"        "enabled"      "geometry"    
## 
## $NHDArea
##  [1] "comid"        "fdate"        "resolution"   "gnis_id"      "gnis_name"   
##  [6] "areasqkm"     "elevation"    "ftype"        "fcode"        "shape_length"
## [11] "shape_area"   "onoffnet"     "purpcode"     "purpdesc"     "geometry"    
## 
## $NHDWaterbody
##  [1] "comid"        "fdate"        "resolution"   "gnis_id"      "gnis_name"   
##  [6] "areasqkm"     "elevation"    "reachcode"    "ftype"        "fcode"       
## [11] "shape_length" "shape_area"   "onoffnet"     "purpcode"     "purpdesc"    
## [16] "meandepth"    "lakevolume"   "maxdepth"     "meandused"    "meandcode"   
## [21] "lakearea"     "geometry"    
## 
## $NHDFlowline_NonNetwork
##  [1] "comid"        "fdate"        "resolution"   "gnis_id"      "gnis_name"   
##  [6] "lengthkm"     "reachcode"    "flowdir"      "wbareacomi"   "ftype"       
## [11] "fcode"        "shape_length" "geometry"

Addressing sites to the network

There are two forms of hydrographic addresses: catchment indexing and linear referencing. The former is established with a point in polygon analysis. The latter is more nuanced. The following block shows how to establish both with the data we just retrieved.

Note that hydroloom is compatible with nhdplus and other attribute systems. See hydroloom documentation for more!

GreenBay_FoxRiver_sf_locs <- sf::st_join(
  GreenBay_FoxRiver_sf_locs, 
  hydroloom::st_compatibalize(dplyr::select(nhdplus_data$CatchmentSP, featureid),
                              GreenBay_FoxRiver_sf_locs))

# NOTE that featureid and comid are the same!!
all(GreenBay_FoxRiver_sf_locs$COMID == GreenBay_FoxRiver_sf_locs$featureid)
## [1] TRUE
(linear_references <- hydroloom::index_points_to_lines(
  nhdplus_data$NHDFlowline_Network,
  GreenBay_FoxRiver_sf_locs))
##    point_id    comid      reachcode reachcode_measure       offset
## 1         1 12005077 04030204000430           61.0183 7.977389e-04
## 2         2 12006773 04030204000019           20.5596 1.177518e-03
## 3         3  6802196 04030204002264           93.7429 2.599949e-03
## 4         4 12005077 04030204000430           61.0183 4.828450e-03
## 5         5 12005073 04030204000060            0.0000 1.837381e-03
## 6         6 12005075 04030204000002           34.8899 7.484437e-04
## 7         7 12006773 04030204000019           18.5164 2.579015e-03
## 8         8 12006797 04030204000036           97.5760 5.627477e-04
## 9         9  6802194 04030204002264           91.7988 4.304105e-03
## 10       10 12005077 04030204000430           61.0183 3.097056e-03
## 11       11 12006773 04030204000019           15.2004 3.426734e-03
## 12       12 12005075 04030204000002            1.2704 4.822365e-04
## 13       13 12005075 04030204000002           22.5444 2.719191e-04
## 14       14 12006757 04030204000002           59.0151 8.358642e-04
## 15       15 12006759 04030204000003            0.0000 9.384823e-04
## 16       16 12006765 04030204000010           67.6320 2.194561e-04
## 17       17 12006773 04030204000019            3.6289 1.029342e-03
## 18       18 12006773 04030204000019            8.0769 2.663165e-03
## 19       19 12005077 04030204000430           61.0183 1.690398e-03
## 20       20 12005073 04030204000060            8.4000 7.281552e-04
## 21       21 12006773 04030204000019           15.2004 3.002152e-03
## 22       22 12006757 04030204000002           85.9012 9.870182e-04
## 23       23 12005077 04030204000430           61.0183 4.628034e-03
## 24       24 12005075 04030204000002           34.8899 1.855466e-03
## 25       25 12007779 04030204000011            8.4184 5.780138e-05
## 26       26 12006759 04030204000003           17.2322 4.991553e-04
## 27       27 12006783 04030204000019           77.0059 2.856606e-04
GreenBay_FoxRiver_sf_locs <- dplyr::bind_cols(GreenBay_FoxRiver_sf_locs, linear_references)

We can take this one step further by indexing points to waterbodies! The return here tells us what waterbody our locations are near or within. For on-network waterbodies, it will also include the outlet flowling for each waterbody.

all_wb <- dplyr::bind_rows(dplyr::select(nhdplus_data$NHDWaterbody, wbid = comid),
                           dplyr::select(nhdplus_data$NHDArea, wbid = comid))

(waterbody_indexes <- hydroloom::index_points_to_waterbodies(
  sf::st_transform(all_wb, 5070), 
  GreenBay_FoxRiver_sf_locs, 
  flines = nhdplus_data$NHDFlowline_Network, 
  search_radius = units::as_units(1000, "m")))
## # A tibble: 27 × 4
##    near_wbid near_wb_dist   in_wbid wb_outlet_id
##        <int>        <dbl>     <int>        <int>
##  1 120049201         23.5 120049201     12005077
##  2 120049201         84.5 120049201     12005077
##  3 120049201         13.3 120049201     12005077
##  4 120049201        158.  120049201     12005077
##  5 120049201         80.6 120049201     12005077
##  6 120049201         82.0 120049201     12005077
##  7 120049201        120.  120049201     12005077
##  8 120049201         29.4 120049201     12005077
##  9 120049201        109.  120049201     12005077
## 10 120049201         77.8 120049201     12005077
## # ℹ 17 more rows
par(mar=c(0,0,0,0))
nhdplusTools::plot_nhdplus(bbox = sf::st_bbox(GreenBay_FoxRiver_sf), 
                           cache_data = tempfile(fileext = ".rds"))
plot(sf::st_transform(all_wb[all_wb$wbid %in% waterbody_indexes$near_wbid,], 
                      3857),
     add = TRUE, 
     col = "darkblue", border = NA)
plot(sf::st_transform(sf::st_geometry(GreenBay_FoxRiver_sf_locs), 3857), add = TRUE, col = "white")

There’s much much more where that came from. See the pkgdown sites for nhdplusTools and hydroloom for more!

Accessing watershed information for sites

We can access watershed information for each unique site location, getting both the landscape data for local catchment or the full upstream watershed for each particular site using StreamCatTools

Discover what StreamCat metrics we might want to use

metrics <- StreamCatTools::sc_get_params(param = 'metric_names')
print(paste0('A selection of available StreamCat metrics include: ',paste(metrics[1:10],collapse = ', ')))
## [1] "A selection of available StreamCat metrics include: agkffact, al2o3, bankfulldepth, bankfullwidth, bfi, canaldens, cao, cbnf, chem, clay"

Discover land cover of watersheds for sites

We’ll pull in all the NLCD categories at the local catchment level for each location

GB_FR_NLCD <- StreamCatTools::sc_nlcd(year='2019', aoi='cat', comid=GreenBay_FoxRiver_sf_locs$COMID)


GB_FR_Urb <- GB_FR_NLCD |> 
  dplyr::mutate(Pct_Urbanized = pcturbop2019cat+pcturbmd2019cat+pcturblo2019cat+pcturbhi2019cat) |> 
  dplyr::select(comid,Pct_Urbanized)
GB_FR_Urb
##       comid Pct_Urbanized
## 1  12005073         88.66
## 2  12005075         95.87
## 3  12005077         93.88
## 4  12006757         82.40
## 5  12006759         62.50
## 6  12006765         73.76
## 7  12006773         75.87
## 8  12006783         14.95
## 9  12006797         27.37
## 10 12007779         49.64

Visualize urbanization for local catchment for each location

ggplot2::ggplot(GB_FR_Urb, ggplot2::aes(x=Pct_Urbanized)) + 
  ggplot2::geom_density()

Pull in data for modeling using StreamCatTools

Now we’ll just demonstrate pulling in watershed data that we might use in a modeling exercise as spatial covariates

ws_data <- StreamCatTools::sc_get_data(metric='fert,nsurp,nani,manure,IWI', aoi='cat,ws', comid=GreenBay_FoxRiver_sf_locs$COMID)

Building Statistical Models

Spatial Dependence

  • For spatial data, nearby observations tend to be more similar than distant observations

  • This phenomena is called spatial dependence and can be built into statistical models

  • The benefits of incorporating spatial dependence are significant and include:

    • More realistic characterization of ecological drivers
    • More precise predictions at unobserved locations

The spmodel R package

  • The spmodel R package makes spatial models accessible via straightforward extensions to common modeling functions like lm() and glm()
  • Spatial dependence is based on Euclidean (straight-line) distance
  • Learn more at https://usepa.github.io/spmodel/

The SSN2 R package

  • Like spmodel, SSN2 extends common modeling functions like lm() and glm()
  • Spatial dependence is based on stream network distance (flow-connected, flow-unconnected)
  • SSN2 is an updated version of SSN (SSN has been archived)
  • Learn more at https://usepa.github.io/SSN2/

TADA Module 2: Geospatial Functions

Additional functions and a more detailed example workflow is available here: https://usepa.github.io/EPATADA/articles/TADAModule2.html

CWA Assessment Process

We do not have time to cover the full process today. Let’s focus on geospatial aspects!

Integrated Reporting Memoranda under CWA Sections 303(d), 305(b) and 314.

What are Assessment Units?

Geospatial areas for analysis. Let’s assign data to those units!

CWA assessment determinations are made by assessment unit, meaning the entire assessment unit is assessed as either meeting or not meeting water quality standards (i.e., thresholds or criteria) for all designated uses.

How are assessment units delineated?

Assessment units are typically delineated by using watershed-oriented collections of stream reaches, often broken down by physical features like waterfalls, bridge crossings, or changes in land use, to analyze water quality impairments within a specific area, ensuring data homogeneity and spatial clarity within the assessment unit.

  • Existing Assessment Units are available from ATTAINS geospatial services

Associating ATTAINS Assessment Units with WQP Monitoring Locations

One of the first steps in the CWA assessment process is to define Assessment Units and associate data with them. A major source for water quality data is the WQP.

Associating ATTAINS Assessment Units with WQP Monitoring Locations

  • Assessment Units: state or tribal waterbody geospatial features
    • These may be lines, areas or points
  • Water Quality Portal Monitoring Locations
    • These are points

TADA_GetATTAINS() Part A

  • Automates matching of WQP monitoring locations with ATTAINS assessment units that fall within (intersect) the same NHDPlus catchment (details)
  • The function uses high resolution NHDPlus catchments by default because 80% of state submitted assessment units in ATTAINS were developed based on high res NHD; users can select med-res if applicable to their use case
WQP_with_ATTAINSonly <- TADA_GetATTAINS(GreenBay_FoxRiver_Subset, fill_catchments = FALSE, return_sf = TRUE)

TADA_ViewATTAINS(WQP_with_ATTAINSonly)

Challenges with Automated Approach

  • Certain NHDPlus high res catchments overlap multiple ATTAINS assessment units (state submitted hydrography) which means the sites are assigned to both AUs in the current functions. Another challenge is that the WQP sites are not always accurate (imprecise coordinates). WQP location metadata may also be helpful for matching/QAQC’ing waterbody names with ATTAINS waterbody names instead of relying solely on the lat/long and geospatial/mapping information. Users must manually review associations for accuracy.

Using all available data from the WQP

Finally, some waterbodies have data available in the WQP or from other sources, but there are no existing Assessment Units in ATTAINS for them.

TADA_GetATTAINS() Part B

See: nhdplusTools, TADA::fetchNHD(), TADA_GetATTAINS(), TADA_ViewATTAINS()

Creating new AUs to assess additional areas (leveraging USGS’s nhdplusTools and TADA geospatial functions)

TADA has included a way to explore this using TADA_GetATTAINS() fill_catchments function input. This is included for exploratory purposes only. In theory, states and tribal nations could use the high res catchments as new assessment unit polygons to assess additional areas where there is WQP data but no Assessment Unit yet in ATTAINS, but that process is outside of TADA.

Creating new AUs to assess additional areas

For WQP monitoring sites that DO NOT overlap an existing ATTAINS feature (neither ATTAINS NHDPlus high res catchment snap shot or state submitted points/lines/polys), there is nothing to use from ATTAINS because these are areas where there is WQP data but no ATTAINS Assessment Unit yet.

Creating new AUs to assess additional areas

For these, we implemented a solution using NHDPlusTools to pull in either NHDPlus high res or med res catchments (user can choose, but high res is the default) and match those with the WQP sites & create new IDs (essentially creating new AUs that are the catchments that intersect these WQP sites).

WQP_withATTAINSandNHDPluscatchments <- TADA_GetATTAINS(GreenBay_FoxRiver_Subset, fill_catchments = TRUE, return_sf = TRUE)

TADA_ViewATTAINS(WQP_withATTAINSandNHDPluscatchments)

The end!

end.time <- Sys.time()

end.time - start.time
## Time difference of 1.401874 mins

Contribute

Note: TADA is still under development. New functionality is added weekly, and sometimes we need to make bug fixes in response to tester and user feedback. We appreciate your feedback, patience, and interest in these helpful tools.

If you are interested in contributing to TADA development, more information is available at:

https://usepa.github.io/EPATADA/articles/CONTRIBUTING.html

We welcome collaboration with external partners.

Contribute to EPATADA in a way that helps elevate work you have already done, broadens the user base of the package, or improves the resource for all!

Thank you to our workshop contributors!

  • EPA: Cristina Mullin (mullin.cristina@epa.gov), Marc Weber, Hillary Marler, Kenny Wong, Michael Dumelle, Shelly Thawley

  • USGS: Dave Blodgett