Participatory (Volunteer) Scientists are Monitoring Waters and Sharing Data via EPA's Water Quality eXchange (WQX)
TADA Team
2025-10-20
Source:vignettes/Participatory-Science-Water-Projects-in-WQX.Rmd
Participatory-Science-Water-Projects-in-WQX.RmdFind volunteer data in WQX
Let’s explore participatory
science water projects using the Water Quality eXchange (WQX), Water
Quality Portal (WQP), and the EPATADA R Package. To start, let’s find
volunteer monitoring organizations who have submitted data to EPA’s
Water Quality eXchange (WQX) by reviewing the organization
domain table available here.
# Get the WQX organizations domain
organizations <- read.csv(url("https://cdx.epa.gov/wqx/download/DomainValues/Organization.CSV"))
# Subset to include only "Volunteer" organizations and exclude WQX test/training orgs
volunteer_orgs <- subset(
organizations,
Type == "Volunteer" &
!grepl("training",
Name,
ignore.case = TRUE
) &
!grepl("test",
Name,
ignore.case = TRUE
) &
!grepl("\\*",
Name,
ignore.case = TRUE
)
)
unique(volunteer_orgs$Name)Generate a list of 5 random organization IDs:
random_volunteer_orgIDs <- sample(volunteer_orgs$ID, size = 5)Prepare list for use in TADA_DataRetrieval:
unlist(random_volunteer_orgIDs)Query the Water Quality Portal (WQP) using
TADA_DataRetrieval and the 5 random volunteer organizations
IDs. We will look for any data available from 2015 to present.
volunteer_data <- TADA_DataRetrieval(
startDate = "2015-01-01",
organization = random_volunteer_orgIDs,
ask = FALSE,
applyautoclean = TRUE
)Retrieve data from WQP
Alternatively, choose 5 volunteer monitoring organizations and query WQP. We will move forward with this example of volunteer organizations in CT:
selected_orgs <-
c(
"CONNRIVERCONSERVANCY",
"CT_NERR",
"BANTAMLAKE_WQX",
"CTVOLMON",
"CT_NERR"
)
volunteer_data <- TADA_DataRetrieval(
# startDate = "2022-01-01",
organization = selected_orgs,
ask = FALSE,
applyautoclean = TRUE
)## Checking what data is available. This may take a moment.
## The number of sites and/or records matched by the query terms is large, so the download may take some time.
## [1] "Downloading data from sites with fewer than 350000 results by grouping them together."
## | | | 0% | |= | 2% | |====== | 9% | |============== | 20% | |=========================== | 38% | |======================================================================| 100%
## [1] "Data successfully downloaded. Running TADA_AutoClean function."
## [1] "TADA_Autoclean: creating TADA-specific columns."
## [1] "TADA_Autoclean: harmonizing dissolved oxygen characterisic name to DISSOLVED OXYGEN SATURATION if unit is % or % SATURATN."
## [1] "TADA_Autoclean: handling special characters and coverting TADA.ResultMeasureValue and TADA.DetectionQuantitationLimitMeasure.MeasureValue value fields to numeric."
## [1] "TADA_Autoclean: converting TADA.LatitudeMeasure and TADA.LongitudeMeasure fields to numeric."
## [1] "TADA_Autoclean: harmonizing synonymous unit names (m and meters) to m."
## [1] "TADA_Autoclean: updating deprecated (i.e. retired) characteristic names."
## [1] "No deprecated characteristic names found in dataset."
## [1] "TADA_Autoclean: harmonizing result and depth units."
## [1] "TADA_Autoclean: creating TADA.ComparableDataIdentifier field for use when generating visualizations and analyses."
## [1] "NOTE: This version of the TADA package is designed to work with numeric data with media name: 'WATER'. TADA_AutoClean does not currently remove (filter) data with non-water media types. If desired, the user must make this specification on their own outside of package functions. Example: dplyr::filter(.data, TADA.ActivityMediaName == 'WATER')"
Explore and refine results
Review volunteer monitoring projects in WQX:
unique(volunteer_data$ProjectName)## [1] "Riffle Bioassessment by Volunteers Program"
## [2] "Volunteer Stream Temperature Monitoring Network"
## [3] "CRC and Affiliate Monitoring"
## [4] "Connecticut Lake Watch"
## [5] "CRC 2019 Bacteria Monitoring"
## [6] "Chicopee Four Rivers Watershed Council 2019"
## [7] "Deerfield River Watershed Association 2019"
## [8] "Connecticut River Conservancy/Connecticut River Watershed Council 2012-2018"
## [9] "2019 Anguilla Brook Watershed Bacteria Source Trackdown"
## [10] "2012 Flat Brook Trackdown Survey"
## [11] "Connecticut River Conservancy 2020"
## [12] "Fort River Watershed Asssociation 2020"
## [13] "Deerfield River Watershed Association 2020"
## [14] "Chicopee 4River Watershed Council 2020"
## [15] "Pomperaug River Watershed Based Plan Implementation Groundwork: Additional Water Quality Monitoring, Agricultural Outreach, BMP Implementation Design and Landowner Agreements to Address Bacteria Impairments. EPA RFA No. 21059"
## [16] "Still River Watershed Pollution Trackdown Survey. CTDEEP Contract No. 17-06"
## [17] "Connecticut River Conservancy 2021"
## [18] "Chicopee 4 Rivers Watershed Council 2021"
## [19] "Deerfield River Watershed Association 2021"
## [20] "Connecticut River Conservancy 2022"
## [21] "Deerfield River Watershed Association 2022"
## [22] "Chicopee 4 Rivers Watershed Council 2022"
## [23] "Clean Up Sound & Harbors (CUSH) volunteer water monitoring program"
## [24] "CT Harbor Watch water monitoring program"
Generate pie chart:
TADA_FieldValuesPie(volunteer_data, field = "MonitoringLocationTypeName")
Review and remove sites if coordinates are imprecise or outside US:
volunteer_data <- TADA_FlagCoordinates(volunteer_data,
clean_outsideUSA = "remove",
clean_imprecise = TRUE,
flaggedonly = FALSE
)Use TADA_OverviewMap to generate a map:
TADA_OverviewMap(volunteer_data)Review and remove duplicate results if present:
volunteer_data <- TADA_FindPotentialDuplicatesSingleOrg(volunteer_data)## [1] "TADA_FindPotentialDuplicatesSingleOrg: 355 groups of potentially duplicated results found in dataset. These have been placed into duplicate groups in the TADA.SingleOrgDupGroupID column and the function randomly selected one result from each group to represent a single, unduplicated value. Selected values are indicated in the TADA.SingleOrgDup.Flag as 'Unique', while duplicates are flagged as 'Duplicate' for easy filtering."
volunteer_data <- dplyr::filter(volunteer_data, TADA.SingleOrgDup.Flag == "Unique")Prepare censored (nondetects and overdetects) results for analysis:
volunteer_data <- TADA_SimpleCensoredMethods(
volunteer_data,
nd_method = "multiplier",
nd_multiplier = 0.5,
od_method = "as-is",
od_multiplier = "null"
)## [1] "TADA_IDCensoredData: 85 records in supplied dataset have conflicting detection condition and detection limit type information. These records will not be included in detection limit handling calculations."
Run key TADA quality control flagging functions and remove suspect results:
volunteer_data <- TADA_RunKeyFlagFunctions(
volunteer_data,
clean = TRUE
)## [1] "TADA_FindQCActivities: Quality control samples have been removed or were not present in the input dataframe. Returning dataframe with TADA.ActivityType.Flag column for tracking."
Flag results above and below thresholds. Review carefully and consider removing.
volunteer_data <- TADA_FlagAboveThreshold(volunteer_data,
clean = FALSE,
flaggedonly = FALSE
)## TADA_FlagAboveThreshold: Returning the dataframe with flags. Counts: NA - Not Available: 2870, Pass: 34559, Suspect: 689
volunteer_data <- TADA_FlagBelowThreshold(volunteer_data,
clean = FALSE,
flaggedonly = FALSE
)## TADA_FlagBelowThreshold: No data below the WQX Lower Threshold was found in your dataframe. Returning the input dataframe with TADA.ResultValueBelowLowerThreshold.Flag column for tracking. Counts: NA - Not Available: 2870, Pass: 35248
Harmonize synonyms if found:
volunteer_data <- TADA_HarmonizeSynonyms(volunteer_data)Generate table:
TADA_FieldValuesTable(volunteer_data, field = "ActivityTypeCode")## Value Count
## 1 Sample-Routine 34565
## 2 Field Msr/Obs 3553
Generate pie chart:
TADA_FieldValuesPie(volunteer_data, field = "OrganizationFormalName")
Generate pie chart:
TADA_FieldValuesPie(volunteer_data, field = "MonitoringLocationName")
Remove non-numeric results:
volunteer_data <- TADA_ConvertSpecialChars(
volunteer_data,
col = "TADA.ResultMeasureValue",
clean = TRUE
)Review the number of sites and records for each characteristic:
TADA_SummarizeColumn(volunteer_data)## # A tibble: 22 × 3
## TADA.CharacteristicName$TADA.CharacteristicName n_sites n_records
## <chr> <int> <int>
## 1 AMMONIA 15 244
## 2 CONDUCTANCE 15 149
## 3 COUNT 808 23121
## 4 DEPTH, SECCHI DISK DEPTH 41 474
## 5 DISSOLVED OXYGEN (DO) 15 589
## 6 ENTEROCOCCUS 15 117
## 7 ESCHERICHIA COLI 411 8641
## 8 FECAL COLIFORM 16 129
## 9 INORGANIC NITROGEN (NO2, NO3, & NH3) 15 269
## 10 NITRATE 30 303
## # ℹ 12 more rows
Filter data to review a single characteristic:
ecoli <- dplyr::filter(
volunteer_data,
TADA.ComparableDataIdentifier %in% c(
"ESCHERICHIA COLI_NA_NA_CFU/100ML"
)
)Generate scatter plot for E. coli:
TADA_GroupedScatterplot(ecoli)## [1] "TADA_GroupedScatterplot: No 'groups' selected for MonitoringLocationName. There are 236 MonitoringLocationNames in the TADA dataframe. The top four MonitoringLocationNames by number of results will be plotted: Sunderland Boat Ramp; CT River at Barton Cove Boat Ramp (now MA-CTR_122.5); DCR/UMASS boat dock and Oxbow/Easthampton Boat Ramp."
Filter to a single site and continue exploring E. coli:
ecoli <- dplyr::filter(
ecoli,
TADA.MonitoringLocationIdentifier %in% c(
"CONNRIVERCONSERVANCY-OXE1"
)
)Let’s check if any results are above the EPA 304A recommended maximum criteria magnitude (see: 2012 Recreational Water Quality Criteria Fact Sheet).
If interested, you can find other state, tribal, and EPA 304A criteria in EPA’s Criteria Search Tool.
Let’s check if any individual results exceed 320 CFU/100mL (the magnitude component of the EPA recommendation 2 criteria for ESCHERICHIA COLI).
# add column with comparison to criteria mag (excursions)
ecoli <- ecoli %>%
dplyr::mutate(meets_criteria_mag = ifelse(TADA.ResultMeasureValue <= 320, "Yes", "No"))
# review subset
ecoli_subset_review <- ecoli %>%
dplyr::select(
MonitoringLocationIdentifier, OrganizationFormalName, ActivityStartDate, TADA.ResultMeasureValue,
meets_criteria_mag
)
TADA_TableExport(ecoli_subset_review)Generate stats table. Review percentiles. Less than 5% of results fall above ~19 CFU/100mL and over 98% of results fall below ~2185 CFU/100m
TADA_TableExport(TADA_Stats(ecoli))Generate a scatterplot. One result value is above the threshold.
TADA_Scatterplot(ecoli, id_cols = "TADA.ComparableDataIdentifier") %>%
plotly::add_lines(
y = 320,
x = c(min(ecoli$ActivityStartDate), max(ecoli$ActivityStartDate)),
inherit = FALSE,
showlegend = FALSE,
line = list(color = "red"),
hoverinfo = "none"
)Generate a histogram.
TADA_Histogram(ecoli, id_cols = "TADA.ComparableDataIdentifier")TADA_Boxplot can be useful for identifying skewness and
percentiles.
TADA_Boxplot(ecoli, id_cols = "TADA.ComparableDataIdentifier")Check out other example R workflows designed to work with WQP data under the Articles tab on the EPATADA package website.
