Geospatial Data Integration
2025-05-07
Source:vignettes/GeospatialDataIntegration.Rmd
GeospatialDataIntegration.Rmd
Overview
Let’s walk through how to create an efficient and reproducible workflow that integrates several R Packages developed by the U.S. Environmental Protection Agency (EPA) and the U.S. Geological Survey (USGS) to support water quality programs (such as the Clean Water Act) and geospatial (watershed or waterbody) level analyses.
This workflow demonstrates potential uses (beyond their original collection purpose) for publicly available water quality data from WQP. To start, participants will learn how to use EPA’s Tools for Automated Data Analysis (TADA) R Package to retrieve, wrangle, harmonize, quality check, visualize and analyze WQP data from multiple organizations.
Next, we will showcase how to bring in other web services and libraries for easy integration of additional hydrologic and geospatial data. We then plan to touch briefly on packages that can assist with building statistical models. Finally, we will demonstrate an example for analyzing water quality by Assessment Units (AUs), which are state or tribal nation defined watershed or waterbody areas used for CWA assessments and reporting water quality conditions to EPA and the public.
Intended Audience
Water Quality eXchange (WQX) and Water Quality Portal (WQP) community, Clean Water Act (CWA) community (EPA, States and Tribal Nations), water quality and geospatial data analysts/researchers, EPA/USGS and other federal agencies.
Leveraged R Packages
EPA: EPATADA, StreamCatTools, spsurvey, spmodel, SSN2
USGS: dataRetrieval, nhdplusTools, hydroloom
Fundamental geospatial packages: sf, prism, terra, leaflet and tmap
Install and load packages
We will be leveraging the EPATADA R Package for WQP data retrieval, cleaning, visualization and other steps needed to prepare for analysis. Let’s dive into Green Bay, WI!
First, install and load the remotes package specifying the repo. This is needed before installing EPATADA because it is only available on GitHub (not CRAN).
install.packages("remotes",
repos = "http://cran.us.r-project.org"
)
library(remotes)
Next, install and load EPATADA using the remotes package. All other dependencies for this workflow will be downloaded automatically. We will also need to install StreamCatTools for some steps later in the workflow.
remotes::install_github("USEPA/EPATADA",
ref = "develop",
dependencies = TRUE,
force = TRUE)
remotes::install_github("USEPA/StreamCatTools",
ref = "master",
dependencies = TRUE,
force = TRUE)
library(EPATADA)
library(StreamCatTools)
It’s go time! Let’s time our process.
# Record start time
start.time <- Sys.time()
WQP data discovery and cleaning
This is an abbreviated introduction to key TADA Module 1 WQP Data Discovery and Cleaning functions. Additional functions and a more detailed example workflow is available here.
Retrieve data from the WQP
In this example, we will first use EPA’s How’s My Waterway (HMW) application to find an applicable Hydrologic Unit Code (HUC) for our area of interest - the Fox River, Green Bay, WI. Next, let’s query the WQP using the identified HUC, state abbreviation, and a date range. In this example, we’ll start by pulling all data available in the WQP for this HUC 12 in Wisconsin for the last 5 years.
WATERSHED: City of Green Bay-Fox River (040302040405)
# Uncomment to query the WQP
GreenBay_FoxRiver <- TADA_DataRetrieval(
statecode = "WI",
startDate = "2015-01-01",
endDate = "2024-12-30",
huc = c("040302040405"),
applyautoclean = TRUE,
ask = FALSE
)
## [1] "Downloading WQP query results. This may take some time depending upon the query size."
## $statecode
## [1] "US:55"
##
## $huc
## [1] "040302040405"
##
## $startDate
## [1] "2015-01-01"
##
## $endDate
## [1] "2024-12-30"
##
## [1] "Data successfully downloaded. Running TADA_AutoClean function."
## [1] "TADA_Autoclean: creating TADA-specific columns."
## [1] "TADA_Autoclean: harmonizing dissolved oxygen characterisic name to DISSOLVED OXYGEN SATURATION if unit is % or % SATURATN."
## [1] "TADA_Autoclean: handling special characters and coverting TADA.ResultMeasureValue and TADA.DetectionQuantitationLimitMeasure.MeasureValue value fields to numeric."
## [1] "TADA_Autoclean: converting TADA.LatitudeMeasure and TADA.LongitudeMeasure fields to numeric."
## [1] "TADA_Autoclean: harmonizing synonymous unit names (m and meters) to m."
## [1] "TADA_Autoclean: updating deprecated (i.e. retired) characteristic names."
## [1] "200 results in your dataset have one of the following deprecated characteristic names: 1-Octanesulfonic acid, 3,3,4,4,5,5,6,6,7,7,8,8,8-tridecafluoro-; 1,2-Benzenedicarboxamide, N2-[1,1-dimethyl-2-(methylsulfonyl)ethyl]-3-iodo-N1-[2-methyl-4-[1,2,2,2-tetrafluoro-1-(trifluoromethyl)ethyl]phenyl]-; Decabromodiphenyl ether; Desisopropyl atrazine; Inorganic nitrogen (nitrate and nitrite); Inorganic nitrogen (nitrate and nitrite) ***retired***use Nitrate + Nitrite; N-ethyl Perfluorooctane sulfonamide. These names have been substituted with the updated preferred names in the TADA.CharacteristicName field."
## [1] "TADA_Autoclean: harmonizing result and depth units."
## [1] "TADA_Autoclean: creating TADA.ComparableDataIdentifier field for use when generating visualizations and analyses."
## [1] "NOTE: This version of the TADA package is designed to work with numeric data with media name: 'WATER'. TADA_AutoClean does not currently remove (filter) data with non-water media types. If desired, the user must make this specification on their own outside of package functions. Example: dplyr::filter(.data, TADA.ActivityMediaName == 'WATER')"
# GreenBay_FoxRiver <- NMCWorkshopData::GreenBay_FoxRiver
Wrangle
Now, let’s use EPATADA functions to review, visualize, and whittle the returned WQP data down to include only results that are applicable to our water quality analysis and area of interest.
Autoclean
In the default arguments for TADA_DataRetrieval, applyautoclean = TRUE. This runs TADA_AutoClean on the newly retrieved data frame. TADA_AutoClean is a powerful function which performs a variety of tasks including: (1) creating new “TADA” prefixed columns and and capitalizing their contents to reduce case sensitivity issues, (2) converts special characters in value columns, (3) converts latitude and longitude values to numeric, (4) replaces “meters” with “m”, (5) replaces deprecated characteristic names with current WQX names, (6) harmonizes result and detection limit units to WQX, TADA or user supplied target units, (7) converts depths to meters, and (8) creates the column TADA.ComparableID by concatenating characteristic name, result sample fraction, method speciation, and result measure unit.
Flag and remove duplicate results from a single organization
We can identify data records uploaded by the same organization with the same date, time, monitoring location, activity type, characteristic name, fraction, taxonomic name, depth columns, and result value and flags them as potential duplicates. The data user must determine if the data records are unique or represent overlap that could cause issues in analysis. For this example, we will retain only results flagged as “Unique”.
# find duplicate results submitted by single org
GreenBay_FoxRiver <- TADA_FindPotentialDuplicatesSingleOrg(GreenBay_FoxRiver)
## [1] "TADA_FindPotentialDuplicatesSingleOrg: 117 groups of potentially duplicated results found in dataset. These have been placed into duplicate groups in the TADA.SingleOrgDupGroupID column and the function randomly selected one result from each group to represent a single, unduplicated value. Selected values are indicated in the TADA.SingleOrgDup.Flag as 'Unique', while duplicates are flagged as 'Duplicate' for easy filtering."
# retain unique flagged results
GreenBay_FoxRiver <- dplyr::filter(GreenBay_FoxRiver, TADA.SingleOrgDup.Flag == "Unique")
Censored results
TADA provides some simple methods for dealing with censored results, such as using multiplying the detection limit by a user supplied value or leaving the result as is.
# substitude nondetects with 0.5 detection limit, leave overdetects as is
GreenBay_FoxRiver <- TADA_SimpleCensoredMethods(GreenBay_FoxRiver, nd_method = "multiplier", nd_multiplier = 0.5, od_method = "as-is", od_multiplier = "null")
## [1] "Dataframe does not include any information (all NA's) in MeasureQualifierCode."
## [1] "TADA_IDCensoredData: There are 9 results in your dataframe that are missing ResultDetectionConditionText. TADA requires BOTH ResultDetectionConditionText and DetectionQuantitationLimitTypeName fields to be populated in order to categorize censored data."
Flag and remove duplicates from multiple organizations
Two organizations sometimes submit the same exact data to WQP. Filtering out these duplicates can prevent issues in analysis.
# find potential dups multiple orgs
GreenBay_FoxRiver <- TADA_FindPotentialDuplicatesMultipleOrgs(GreenBay_FoxRiver)
# filter out
GreenBay_FoxRiver <- dplyr::filter(GreenBay_FoxRiver, TADA.ResultSelectedMultipleOrgs == "Y")
Filter out any remaining irrelevant data, NA’s and empty columns
unique(GreenBay_FoxRiver$TADA.ResultMeasureValueDataTypes.Flag)
## [1] "Numeric"
## [2] "Result Value/Unit Estimated from Detection Limit"
## [3] "Result Value/Unit Copied from Detection Limit"
## [4] "Greater Than"
## [5] "Text"
## [6] "Coerced to NA"
## [1] 63
GreenBay_FoxRiver <- TADA_AutoFilter(GreenBay_FoxRiver)
## [1] "Quality control samples have been removed or were not present in the input dataframe. Returning dataframe with TADA.ActivityType.Flag column for tracking."
## [1] "TADA_Autofilter: removing columns not required for TADA workflow if they contain only NAs."
## [1] "The following column(s) were removed as they contained only NAs: ActivityDepthAltitudeReferencePointText, SampleAquifer, BinaryObjectFileName, BinaryObjectFileTypeCode, LabSamplePreparationUrl, FormationTypeText, ProjectMonitoringLocationWeightingUrl, ContributingDrainageAreaMeasure.MeasureValue and ContributingDrainageAreaMeasure.MeasureUnitCode."
## [1] "TADA_Autofilter: checking required columns for non-NA values."
## [1] "TADA Required column(s) ResultDepthAltitudeReferencePointText, ActivityRelativeDepthName, ActivityTopDepthHeightMeasure.MeasureValue, TADA.ActivityTopDepthHeightMeasure.MeasureValue, ActivityTopDepthHeightMeasure.MeasureUnitCode, TADA.ActivityTopDepthHeightMeasure.MeasureUnitCode, ActivityBottomDepthHeightMeasure.MeasureValue, TADA.ActivityBottomDepthHeightMeasure.MeasureValue, ActivityBottomDepthHeightMeasure.MeasureUnitCode, TADA.ActivityBottomDepthHeightMeasure.MeasureUnitCode, ResultTimeBasisText, ResultFileUrl, ResultAnalyticalMethod.MethodUrl, SampleCollectionMethod.MethodDescriptionText, MeasureQualifierCode, DataQuality.PrecisionValue, DataQuality.BiasValue, DataQuality.ConfidenceIntervalValue, DataQuality.UpperConfidenceLimitValue, DataQuality.LowerConfidenceLimitValue, SamplingDesignTypeCode, ProjectFileUrl, QAPPApprovedIndicator, QAPPApprovalAgencyName, AquiferName, AquiferTypeName, LocalAqfrName, ConstructionDateText, WellDepthMeasure.MeasureValue, WellDepthMeasure.MeasureUnitCode, WellHoleDepthMeasure.MeasureValue and WellHoleDepthMeasure.MeasureUnitCode contain only NA values. This may impact other TADA functions."
## [1] "Function removed 65 results. These results are either text or NA and cannot be plotted or represent quality control activities (not routine samples or measurements)."
Check to make sure there are no more NA’s in TADA.ResultMeasureValue.
unique(GreenBay_FoxRiver$TADA.ResultMeasureValueDataTypes.Flag)
## [1] "Numeric"
## [2] "Result Value/Unit Estimated from Detection Limit"
## [3] "Result Value/Unit Copied from Detection Limit"
## [4] "Greater Than"
## [1] 0
Flag and remove QAQC samples and suspect results
GreenBay_FoxRiver <- TADA_RunKeyFlagFunctions(GreenBay_FoxRiver, clean = TRUE)
## [1] "Quality control samples have been removed or were not present in the input dataframe. Returning dataframe with TADA.ActivityType.Flag column for tracking."
## [1] "Dataframe does not include any information (all NA's) in MeasureQualifierCode."
Flag results above and below threshold, but do not remove them
GreenBay_FoxRiver <- TADA_FlagAboveThreshold(GreenBay_FoxRiver, clean = FALSE, flaggedonly = FALSE)
GreenBay_FoxRiver <- TADA_FlagBelowThreshold(GreenBay_FoxRiver, clean = FALSE, flaggedonly = FALSE)
Harmonize synonyms across characteristic, fraction, and speciation
GreenBay_FoxRiver <- TADA_HarmonizeSynonyms(GreenBay_FoxRiver)
Calculate Total N and Total P from various species and fractions
GreenBay_FoxRiver <- TADA_CalculateTotalNP(GreenBay_FoxRiver, daily_agg = "max")
## [1] "Aggregation results:"
##
## No aggregation needed Selected as max aggregate value
## 11204 874
Review unique characteristic, fraction, and species combinations
GreenBay_FoxRiver_Counts <- TADA_FieldValuesTable(GreenBay_FoxRiver, field = "TADA.ComparableDataIdentifier")
DT::datatable(GreenBay_FoxRiver_Counts, fillContainer = TRUE)
Filter to focus on frequently monitored characteristics in example data
GreenBay_FoxRiver_Subset <- GreenBay_FoxRiver %>%
dplyr::filter(TADA.ComparableDataIdentifier %in%
c("SPECIFIC CONDUCTANCE_NA_NA_US/CM",
"PH_NA_NA_NA",
"TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L",
"TOTAL PHOSPHORUS, MIXED FORMS_UNFILTERED_AS P_UG/L",
"DISSOLVED OXYGEN (DO)_NA_NA_MG/L"))
Review organizations for subset
# Create pie of results by organization
TADA_FieldValuesPie(GreenBay_FoxRiver_Subset, field = "OrganizationFormalName")
Exploratory visualizations
Generate stats table
GreenBay_FoxRiver_Subset_Stats <- TADA_Stats(GreenBay_FoxRiver_Subset)
## [1] "Note: Your dataset contains TADA-generated total nutrient results, which have fewer columns populated with metadata. This might affect how groups are displayed in the stats table."
DT::datatable(GreenBay_FoxRiver_Subset_Stats, fillContainer = TRUE)
Generate scatterplot
TADA_TwoCharacteristicScatterplot(GreenBay_FoxRiver_Subset, id_cols = "TADA.ComparableDataIdentifier", groups = c("TOTAL PHOSPHORUS, MIXED FORMS_UNFILTERED_AS P_UG/L", "TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L"))
Generate map
TADA_OverviewMap(GreenBay_FoxRiver_Subset)
Clean up coordinate issues
# Change coordinate sign if appropriate
GreenBay_FoxRiver = TADA_FlagCoordinates(GreenBay_FoxRiver_Subset, clean_outsideUSA = "change sign", clean_imprecise = FALSE)
## [1] "When clean_outsideUSA == change sign, the sign for any lat/long coordinates flagged as outside of USA are switched. This is a temporary solution. Data owners should fix the raw data to address Suspect coordinates through WQX. For assistance fixing data errors you see in the WQP, email the WQX helpdesk (WQX@epa.gov)."
# This df has NA lons from USGS that must be addressed before TADA_MakeSpatial can be run...
sum(is.na(GreenBay_FoxRiver_Subset$LongitudeMeasure))
## [1] 601
# Remove rows with NA lons from df
GreenBay_FoxRiver_Subset <- GreenBay_FoxRiver_Subset[!is.na(GreenBay_FoxRiver_Subset$LongitudeMeasure),]
# Recheck
sum(is.na(GreenBay_FoxRiver_Subset$LongitudeMeasure))
## [1] 0
Geospatial data integration
Make spatial
Leverage TADA_MakeSpatial to transform a WQP dataframe into a geospatial sf object.
GreenBay_FoxRiver_sf = TADA_MakeSpatial(GreenBay_FoxRiver_Subset)
Then create a unique identifier based on shared lat long values and filter to just the 25 unique locations.
GreenBay_FoxRiver_sf$latlon <- paste0(GreenBay_FoxRiver_sf$TADA.LongitudeMeasure, GreenBay_FoxRiver_sf$TADA.LatitudeMeasure)
GreenBay_FoxRiver_sf <- GreenBay_FoxRiver_sf |>
dplyr::group_by(latlon) |>
dplyr::mutate(loc_id = dplyr::cur_group_id())
GreenBay_FoxRiver_sf_locs <- GreenBay_FoxRiver_sf |>
dplyr::filter(!duplicated(loc_id))
Access NHDPlus COMIDs for sites
We use StreamCatTools
function sc_get_comid
(which uses an nhdplusTools
web service client) to get the
comid for each location.
GreenBay_FoxRiver_sf_locs$COMID <- as.integer(strsplit(StreamCatTools::sc_get_comid(GreenBay_FoxRiver_sf_locs), split = ",")[[1]])
nhdplus_data <- nhdplusTools::subset_nhdplus(GreenBay_FoxRiver_sf_locs$COMID, nhdplus_data = "download")
outlet <- dplyr::filter(nhdplus_data$NHDFlowline_Network, hydroseq == min(hydroseq))
nhdplusTools::plot_nhdplus(bbox = sf::st_bbox(outlet))
plot(sf::st_transform(sf::st_geometry(GreenBay_FoxRiver_sf_locs), 3857), add = TRUE)
dataRetrieval/NLDI, nhdplusTools, hydroloom
Do a network navigation and get NHDPlus for our data. Note that the
network navigation only includes flowline geometry.
nhdplusTools
subsets all of the NHDPlus.
all_network <- dataRetrieval::findNLDI(comid = outlet$comid, nav = "UT", distance_km = 500)
# we could select only comids on network
if(FALSE) # don't run this one
nhdplus_data <- nhdplusTools::subset_nhdplus(comids = as.integer(all_network$UT_flowlines$nhdplus_comid), nhdplus_data = "download", flowline_only = FALSE)
# or we could just get everything in the bbox to be sure we get non-network stuff too!
nhdplus_data <- nhdplusTools::subset_nhdplus(
bbox = sf::st_bbox(all_network$UT_flowlines),
nhdplus_data = "download",
flowline_only = FALSE)
# see ?nhdplusTools::subset_nhdplus for lots more options!
sapply(nhdplus_data, nrow)
## CatchmentSP NHDFlowline_Network NHDArea
## 10522 10599 104
## NHDWaterbody NHDFlowline_NonNetwork
## 3988 410
sapply(nhdplus_data, names)
## $CatchmentSP
## [1] "gridcode" "featureid" "sourcefc" "areasqkm" "shape_length"
## [6] "shape_area" "geometry"
##
## $NHDFlowline_Network
## [1] "comid" "fdate" "resolution" "gnis_id" "gnis_name"
## [6] "lengthkm" "reachcode" "flowdir" "wbareacomi" "ftype"
## [11] "fcode" "shape_length" "streamleve" "streamorde" "streamcalc"
## [16] "fromnode" "tonode" "hydroseq" "levelpathi" "pathlength"
## [21] "terminalpa" "arbolatesu" "divergence" "startflag" "terminalfl"
## [26] "dnlevel" "uplevelpat" "uphydroseq" "dnlevelpat" "dnminorhyd"
## [31] "dndraincou" "dnhydroseq" "frommeas" "tomeas" "rtndiv"
## [36] "vpuin" "vpuout" "areasqkm" "totdasqkm" "divdasqkm"
## [41] "tidal" "totma" "wbareatype" "pathtimema" "hwnodesqkm"
## [46] "maxelevraw" "minelevraw" "maxelevsmo" "minelevsmo" "slope"
## [51] "elevfixed" "hwtype" "slopelenkm" "qa_ma" "va_ma"
## [56] "qc_ma" "vc_ma" "qe_ma" "ve_ma" "qa_01"
## [61] "va_01" "qc_01" "vc_01" "qe_01" "ve_01"
## [66] "qa_02" "va_02" "qc_02" "vc_02" "qe_02"
## [71] "ve_02" "qa_03" "va_03" "qc_03" "vc_03"
## [76] "qe_03" "ve_03" "qa_04" "va_04" "qc_04"
## [81] "vc_04" "qe_04" "ve_04" "qa_05" "va_05"
## [86] "qc_05" "vc_05" "qe_05" "ve_05" "qa_06"
## [91] "va_06" "qc_06" "vc_06" "qe_06" "ve_06"
## [96] "qa_07" "va_07" "qc_07" "vc_07" "qe_07"
## [101] "ve_07" "qa_08" "va_08" "qc_08" "vc_08"
## [106] "qe_08" "ve_08" "qa_09" "va_09" "qc_09"
## [111] "vc_09" "qe_09" "ve_09" "qa_10" "va_10"
## [116] "qc_10" "vc_10" "qe_10" "ve_10" "qa_11"
## [121] "va_11" "qc_11" "vc_11" "qe_11" "ve_11"
## [126] "qa_12" "va_12" "qc_12" "vc_12" "qe_12"
## [131] "ve_12" "lakefract" "surfarea" "rareahload" "rpuid"
## [136] "vpuid" "enabled" "geometry"
##
## $NHDArea
## [1] "comid" "fdate" "resolution" "gnis_id" "gnis_name"
## [6] "areasqkm" "elevation" "ftype" "fcode" "shape_length"
## [11] "shape_area" "onoffnet" "purpcode" "purpdesc" "geometry"
##
## $NHDWaterbody
## [1] "comid" "fdate" "resolution" "gnis_id" "gnis_name"
## [6] "areasqkm" "elevation" "reachcode" "ftype" "fcode"
## [11] "shape_length" "shape_area" "onoffnet" "purpcode" "purpdesc"
## [16] "meandepth" "lakevolume" "maxdepth" "meandused" "meandcode"
## [21] "lakearea" "geometry"
##
## $NHDFlowline_NonNetwork
## [1] "comid" "fdate" "resolution" "gnis_id" "gnis_name"
## [6] "lengthkm" "reachcode" "flowdir" "wbareacomi" "ftype"
## [11] "fcode" "shape_length" "geometry"
Addressing sites to the network
There are two forms of hydrographic addresses: catchment indexing and linear referencing. The former is established with a point in polygon analysis. The latter is more nuanced. The following block shows how to establish both with the data we just retrieved.
Note that hydroloom is compatible with nhdplus and other attribute systems. See hydroloom documentation for more!
GreenBay_FoxRiver_sf_locs <- sf::st_join(
GreenBay_FoxRiver_sf_locs,
hydroloom::st_compatibalize(dplyr::select(nhdplus_data$CatchmentSP, featureid),
GreenBay_FoxRiver_sf_locs))
# NOTE that featureid and comid are the same!!
all(GreenBay_FoxRiver_sf_locs$COMID == GreenBay_FoxRiver_sf_locs$featureid)
## [1] TRUE
(linear_references <- hydroloom::index_points_to_lines(
nhdplus_data$NHDFlowline_Network,
GreenBay_FoxRiver_sf_locs))
## point_id comid reachcode reachcode_measure offset
## 1 1 12005077 04030204000430 61.0183 7.977389e-04
## 2 2 12006773 04030204000019 20.5596 1.177518e-03
## 3 3 6802196 04030204002264 93.7429 2.599949e-03
## 4 4 12005077 04030204000430 61.0183 4.828450e-03
## 5 5 12005073 04030204000060 0.0000 1.837381e-03
## 6 6 12005075 04030204000002 34.8899 7.484437e-04
## 7 7 12006773 04030204000019 18.5164 2.579015e-03
## 8 8 12006797 04030204000036 97.5760 5.627477e-04
## 9 9 6802194 04030204002264 91.7988 4.304105e-03
## 10 10 12005077 04030204000430 61.0183 3.097056e-03
## 11 11 12006773 04030204000019 15.2004 3.426734e-03
## 12 12 12005075 04030204000002 1.2704 4.822365e-04
## 13 13 12005075 04030204000002 22.5444 2.719191e-04
## 14 14 12006757 04030204000002 59.0151 8.358642e-04
## 15 15 12006759 04030204000003 0.0000 9.384823e-04
## 16 16 12006765 04030204000010 67.6320 2.194561e-04
## 17 17 12006773 04030204000019 3.6289 1.029342e-03
## 18 18 12006773 04030204000019 8.0769 2.663165e-03
## 19 19 12005077 04030204000430 61.0183 1.690398e-03
## 20 20 12005073 04030204000060 8.4000 7.281552e-04
## 21 21 12006773 04030204000019 15.2004 3.002152e-03
## 22 22 12006757 04030204000002 85.9012 9.870182e-04
## 23 23 12005077 04030204000430 61.0183 4.628034e-03
## 24 24 12005075 04030204000002 34.8899 1.855466e-03
## 25 25 12007779 04030204000011 8.4184 5.780138e-05
## 26 26 12006759 04030204000003 17.2322 4.991553e-04
## 27 27 12006783 04030204000019 77.0059 2.856606e-04
GreenBay_FoxRiver_sf_locs <- dplyr::bind_cols(GreenBay_FoxRiver_sf_locs, linear_references)
We can take this one step further by indexing points to waterbodies! The return here tells us what waterbody our locations are near or within. For on-network waterbodies, it will also include the outlet flowling for each waterbody.
all_wb <- dplyr::bind_rows(dplyr::select(nhdplus_data$NHDWaterbody, wbid = comid),
dplyr::select(nhdplus_data$NHDArea, wbid = comid))
(waterbody_indexes <- hydroloom::index_points_to_waterbodies(
sf::st_transform(all_wb, 5070),
GreenBay_FoxRiver_sf_locs,
flines = nhdplus_data$NHDFlowline_Network,
search_radius = units::as_units(1000, "m")))
## # A tibble: 27 × 4
## near_wbid near_wb_dist in_wbid wb_outlet_id
## <int> <dbl> <int> <int>
## 1 120049201 23.5 120049201 12005077
## 2 120049201 84.5 120049201 12005077
## 3 120049201 13.3 120049201 12005077
## 4 120049201 158. 120049201 12005077
## 5 120049201 80.6 120049201 12005077
## 6 120049201 82.0 120049201 12005077
## 7 120049201 120. 120049201 12005077
## 8 120049201 29.4 120049201 12005077
## 9 120049201 109. 120049201 12005077
## 10 120049201 77.8 120049201 12005077
## # ℹ 17 more rows
par(mar=c(0,0,0,0))
nhdplusTools::plot_nhdplus(bbox = sf::st_bbox(GreenBay_FoxRiver_sf),
cache_data = tempfile(fileext = ".rds"))
plot(sf::st_transform(all_wb[all_wb$wbid %in% waterbody_indexes$near_wbid,],
3857),
add = TRUE,
col = "darkblue", border = NA)
plot(sf::st_transform(sf::st_geometry(GreenBay_FoxRiver_sf_locs), 3857), add = TRUE, col = "white")
There’s much much more where that came from. See the pkgdown sites for nhdplusTools and hydroloom for more!
Accessing watershed information for sites
We can access watershed information for each unique site location, getting both the landscape data for local catchment or the full upstream watershed for each particular site using StreamCatTools
Discover what StreamCat metrics we might want to use
metrics <- StreamCatTools::sc_get_params(param = 'metric_names')
print(paste0('A selection of available StreamCat metrics include: ',paste(metrics[1:10],collapse = ', ')))
## [1] "A selection of available StreamCat metrics include: agkffact, al2o3, bankfulldepth, bankfullwidth, bfi, canaldens, cao, cbnf, chem, clay"
Discover land cover of watersheds for sites
We’ll pull in all the NLCD categories at the local catchment level for each location
GB_FR_NLCD <- StreamCatTools::sc_nlcd(year='2019', aoi='cat', comid=GreenBay_FoxRiver_sf_locs$COMID)
GB_FR_Urb <- GB_FR_NLCD |>
dplyr::mutate(Pct_Urbanized = pcturbop2019cat+pcturbmd2019cat+pcturblo2019cat+pcturbhi2019cat) |>
dplyr::select(comid,Pct_Urbanized)
GB_FR_Urb
## comid Pct_Urbanized
## 1 12005073 88.66
## 2 12005075 95.87
## 3 12005077 93.88
## 4 12006757 82.40
## 5 12006759 62.50
## 6 12006765 73.76
## 7 12006773 75.87
## 8 12006783 14.95
## 9 12006797 27.37
## 10 12007779 49.64
Visualize urbanization for local catchment for each location
ggplot2::ggplot(GB_FR_Urb, ggplot2::aes(x=Pct_Urbanized)) +
ggplot2::geom_density()
Pull in data for modeling using StreamCatTools
Now we’ll just demonstrate pulling in watershed data that we might use in a modeling exercise as spatial covariates
ws_data <- StreamCatTools::sc_get_data(metric='fert,nsurp,nani,manure,IWI', aoi='cat,ws', comid=GreenBay_FoxRiver_sf_locs$COMID)
Building Statistical Models
Spatial Dependence
For spatial data, nearby observations tend to be more similar than distant observations
This phenomena is called spatial dependence and can be built into statistical models
-
The benefits of incorporating spatial dependence are significant and include:
- More realistic characterization of ecological drivers
- More precise predictions at unobserved locations
The spmodel
R package
- The
spmodel
R package makes spatial models accessible via straightforward extensions to common modeling functions likelm()
andglm()
- Spatial dependence is based on Euclidean (straight-line) distance
- Learn more at https://usepa.github.io/spmodel/
The SSN2
R package
- Like
spmodel
,SSN2
extends common modeling functions likelm()
andglm()
- Spatial dependence is based on stream network distance (flow-connected, flow-unconnected)
-
SSN2
is an updated version ofSSN
(SSN
has been archived) - Learn more at https://usepa.github.io/SSN2/
TADA Module 2: Geospatial Functions
Additional functions and a more detailed example workflow is available here: https://usepa.github.io/EPATADA/articles/TADAModule2.html
CWA Assessment Process
We do not have time to cover the full process today. Let’s focus on geospatial aspects!
Integrated Reporting Memoranda under CWA Sections 303(d), 305(b) and 314.
What are Assessment Units?
Geospatial areas for analysis. Let’s assign data to those units!
CWA assessment determinations are made by assessment unit, meaning the entire assessment unit is assessed as either meeting or not meeting water quality standards (i.e., thresholds or criteria) for all designated uses.
How are assessment units delineated?
Assessment units are typically delineated by using watershed-oriented collections of stream reaches, often broken down by physical features like waterfalls, bridge crossings, or changes in land use, to analyze water quality impairments within a specific area, ensuring data homogeneity and spatial clarity within the assessment unit.
- Existing Assessment Units are available from ATTAINS geospatial services
Associating ATTAINS Assessment Units with WQP Monitoring Locations
One of the first steps in the CWA assessment process is to define Assessment Units and associate data with them. A major source for water quality data is the WQP.
Associating ATTAINS Assessment Units with WQP Monitoring Locations
- Assessment Units: state or tribal waterbody geospatial features
- These may be lines, areas or points
- Water Quality Portal Monitoring Locations
- These are points
TADA_GetATTAINS() Part A
- Automates matching of WQP monitoring locations with ATTAINS assessment units that fall within (intersect) the same NHDPlus catchment (details)
- The function uses high resolution NHDPlus catchments by default because 80% of state submitted assessment units in ATTAINS were developed based on high res NHD; users can select med-res if applicable to their use case
WQP_with_ATTAINSonly <- TADA_GetATTAINS(GreenBay_FoxRiver_Subset, fill_catchments = FALSE, return_sf = TRUE)
TADA_ViewATTAINS(WQP_with_ATTAINSonly)
Challenges with Automated Approach
- Certain NHDPlus high res catchments overlap multiple ATTAINS assessment units (state submitted hydrography) which means the sites are assigned to both AUs in the current functions. Another challenge is that the WQP sites are not always accurate (imprecise coordinates). WQP location metadata may also be helpful for matching/QAQC’ing waterbody names with ATTAINS waterbody names instead of relying solely on the lat/long and geospatial/mapping information. Users must manually review associations for accuracy.
TADA_GetATTAINS() Part B
See: nhdplusTools, TADA::fetchNHD(), TADA_GetATTAINS(), TADA_ViewATTAINS()
Creating new AUs to assess additional areas (leveraging USGS’s nhdplusTools and TADA geospatial functions)
TADA has included a way to explore this using TADA_GetATTAINS() fill_catchments function input. This is included for exploratory purposes only. In theory, states and tribal nations could use the high res catchments as new assessment unit polygons to assess additional areas where there is WQP data but no Assessment Unit yet in ATTAINS, but that process is outside of TADA.
Creating new AUs to assess additional areas
For WQP monitoring sites that DO NOT overlap an existing ATTAINS feature (neither ATTAINS NHDPlus high res catchment snap shot or state submitted points/lines/polys), there is nothing to use from ATTAINS because these are areas where there is WQP data but no ATTAINS Assessment Unit yet.
Creating new AUs to assess additional areas
For these, we implemented a solution using NHDPlusTools to pull in either NHDPlus high res or med res catchments (user can choose, but high res is the default) and match those with the WQP sites & create new IDs (essentially creating new AUs that are the catchments that intersect these WQP sites).
WQP_withATTAINSandNHDPluscatchments <- TADA_GetATTAINS(GreenBay_FoxRiver_Subset, fill_catchments = TRUE, return_sf = TRUE)
TADA_ViewATTAINS(WQP_withATTAINSandNHDPluscatchments)
Contribute
Note: TADA is still under development. New functionality is added weekly, and sometimes we need to make bug fixes in response to tester and user feedback. We appreciate your feedback, patience, and interest in these helpful tools.
If you are interested in contributing to TADA development, more information is available at:
https://usepa.github.io/EPATADA/articles/CONTRIBUTING.html
We welcome collaboration with external partners.
Contribute to EPATADA in a way that helps elevate work you have already done, broadens the user base of the package, or improves the resource for all!
Thank you to our workshop contributors!
EPA: Cristina Mullin (mullin.cristina@epa.gov), Marc Weber, Hillary Marler, Kenny Wong, Michael Dumelle, Shelly Thawley
USGS: Dave Blodgett