Tampa Bay, FL - Simple workflow
Standardize, clean and wrangle Water Quality Portal data in Tampa Bay, FL into more analytic-ready formats using the harmonize_wq package
US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonixe_wq package is intended to be a flexible water quality specific framework to help: - Identify differences in data units (including speciation and basis) - Identify differences in sampling or analytic methods - Resolve data errors using transparent assumptions - Reduce data to the columns that are most commonly needed - Transform data from long to wide format
Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.
Simple workflow
This example steps through a typical workflow in Tampa Bay, FL to demonstrate commonly used functionality
Install the required libraries
[1]:
import sys
#!python -m pip uninstall harmonize-wq --yes
# Use pip to install the package from pypi or the latest from github
#!{sys.executable} -m pip install harmonize-wq
# For latest dev version
#!{sys.executable} -m pip install git+https://github.com/USEPA/harmonize-wq.git@new_release_0-3-8
Create dataretrieval query using a polygon for Area Of Interest
[2]:
from harmonize_wq import wrangle
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/dataretrieval/nadp.py:44: UserWarning: GDAL not installed. Some functions will not work.
warnings.warn('GDAL not installed. Some functions will not work.')
[3]:
# Read geometry for Area of Interest from geojson file url
# NOTE: alternatively you can direct it to a local shapefile
aoi_url = r'https://github.com/USEPA/Coastal_Ecological_Indicators/raw/master/DGGS_Coastal/temperature_data/TampaBay.geojson'
[4]:
# Map aoi
aoi_gdf = wrangle.as_gdf(aoi_url)
# geoJSON should be WGS1984 standard, but this one isn't
aoi_gdf.to_crs(4326, inplace=True)
aoi_gdf.plot()
[4]:
<Axes: >
[5]:
# Build query
query = {'characteristicName': ['Phosphorus',
'Temperature, water',
'Depth, Secchi disk depth',
'Dissolved oxygen (DO)',
'Salinity',
'pH',
'Nitrogen',
'Conductivity',
'Organic carbon',
'Chlorophyll a',
'Turbidity',
'Sediment',
'Fecal Coliform',
'Escherichia coli']}
query['bBox'] = wrangle.get_bounding_box(aoi_gdf)
query['dataProfile'] = 'narrowResult'
Retrieve data
Stations will be retrieved using the query criteria seperately after results. md_narrow is metadata documenting the query not used here but meant for reproducibility
[6]:
import dataretrieval.wqp as wqp
[7]:
# Query for results
# Note: large quieries like this can take up a lot of RAM and may give a DtypeWarning,
# set low_memory=False or provide dataTypes for columns to use less memory.
res_narrow, md_narrow = wqp.get_results(**query)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/dataretrieval/wqp.py:83: DtypeWarning: Columns (9,10,13,15,17,19,22,23,28,31,32,33,36,38,58,60,61,63,64,65,70,71,73) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv(StringIO(response.text), delimiter=',')
[8]:
# Look at initial results (input)
df = res_narrow
df
[8]:
OrganizationIdentifier | OrganizationFormalName | ActivityIdentifier | ActivityStartDate | ActivityStartTime/Time | ActivityStartTime/TimeZoneCode | MonitoringLocationIdentifier | ResultIdentifier | DataLoggerLine | ResultDetectionConditionText | ... | AnalysisEndTime/TimeZoneCode | ResultLaboratoryCommentCode | ResultLaboratoryCommentText | ResultDetectionQuantitationLimitUrl | LaboratoryAccreditationIndicator | LaboratoryAccreditationAuthorityName | TaxonomistAccreditationIndicator | TaxonomistAccreditationAuthorityName | LabSamplePreparationUrl | ProviderName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21FLHILL_WQX | Environmental Protection Commission of Hillsbo... | 21FLHILL_WQX-130612585-W | 2013-06-12 | 11:01:00 | EST | 21FLHILL_WQX-585 | STORET-301235413 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
1 | 21FLSEAS_WQX | Florida Department of Environmental Protection | 21FLSEAS_WQX-481901119134 | 2013-11-19 | 14:01:00 | EST | 21FLSEAS_WQX-48SEAS190 | STORET-310535134 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
2 | 21FLHILL_WQX | Environmental Protection Commission of Hillsbo... | 21FLHILL_WQX-130702047-M | 2013-07-02 | 11:01:00 | EST | 21FLHILL_WQX-047 | STORET-300620295 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
3 | 21FLHILL_WQX | Environmental Protection Commission of Hillsbo... | 21FLHILL_WQX-130716021 | 2013-07-16 | 11:01:00 | EST | 21FLHILL_WQX-021 | STORET-300666279 | NaN | NaN | ... | NaN | NaN | NaN | https://www.waterqualitydata.us/data/providers... | NaN | NaN | NaN | NaN | NaN | STORET |
4 | 21FLHILL_WQX | Environmental Protection Commission of Hillsbo... | 21FLHILL_WQX-131216112-M | 2013-12-16 | 12:01:00 | EST | 21FLHILL_WQX-112 | STORET-301229196 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1465643 | USGS-FL | USGS Florida Water Science Center | nwisfl.01.95500762 | 1955-04-08 | NaN | NaN | USGS-273926082304501 | NWIS-6842541 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
1465644 | USGS-FL | USGS Florida Water Science Center | nwisfl.01.95500762 | 1955-04-08 | NaN | NaN | USGS-273926082304501 | NWIS-6842544 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
1465645 | USGS-FL | USGS Florida Water Science Center | nwisfl.01.95800924 | 1957-10-21 | 14:05:00 | EST | USGS-02306001 | NWIS-6894410 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
1465646 | USGS-FL | USGS Florida Water Science Center | nwisfl.01.95500782 | 1955-04-08 | NaN | NaN | USGS-274455082253601 | NWIS-6842941 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
1465647 | USGS-FL | USGS Florida Water Science Center | nwisfl.01.95500782 | 1955-04-08 | NaN | NaN | USGS-274455082253601 | NWIS-6842944 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
1465648 rows × 78 columns
Harmonize data using defaults
[9]:
from harmonize_wq import harmonize
from harmonize_wq import location
from harmonize_wq import visualize
[10]:
# Harmonize it and look at how it changed (output)
# Note: 'ignore' unit dimensionality errors will warn when they are encountered and replace with NaN
# Note: depending on the Pint version this may cause a UnitStrippedWarning
df_harmonized = harmonize.harmonize_all(df, errors='ignore')
df_harmonized
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'None' UNDEFINED UNIT for Secchi
warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
2 Phosphorus sample fractions not in frac_dict
2 Phosphorus sample fractions not in frac_dict found in expected domains, mapped to "Other_Phosphorus"
[10]:
OrganizationIdentifier | OrganizationFormalName | ActivityIdentifier | ActivityStartDate | ActivityStartTime/Time | ActivityStartTime/TimeZoneCode | MonitoringLocationIdentifier | ResultIdentifier | DataLoggerLine | ResultDetectionConditionText | ... | TOTAL NITROGEN_ MIXED FORMS | Carbon | Phosphorus | TP_Phosphorus | TDP_Phosphorus | Other_Phosphorus | Salinity | Temperature | Turbidity | pH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21FLHILL_WQX | Environmental Protection Commission of Hillsbo... | 21FLHILL_WQX-130612585-W | 2013-06-12 | 11:01:00 | EST | 21FLHILL_WQX-585 | STORET-301235413 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 21FLSEAS_WQX | Florida Department of Environmental Protection | 21FLSEAS_WQX-481901119134 | 2013-11-19 | 14:01:00 | EST | 21FLSEAS_WQX-48SEAS190 | STORET-310535134 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 40.0 Practical_Salinity_Units | NaN | NaN | NaN |
2 | 21FLHILL_WQX | Environmental Protection Commission of Hillsbo... | 21FLHILL_WQX-130702047-M | 2013-07-02 | 11:01:00 | EST | 21FLHILL_WQX-047 | STORET-300620295 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 28.19 degree_Celsius | NaN | NaN |
3 | 21FLHILL_WQX | Environmental Protection Commission of Hillsbo... | 21FLHILL_WQX-130716021 | 2013-07-16 | 11:01:00 | EST | 21FLHILL_WQX-021 | STORET-300666279 | NaN | NaN | ... | NaN | 3.8 milligram / liter | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 21FLHILL_WQX | Environmental Protection Commission of Hillsbo... | 21FLHILL_WQX-131216112-M | 2013-12-16 | 12:01:00 | EST | 21FLHILL_WQX-112 | STORET-301229196 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7.48 dimensionless |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1465643 | USGS-FL | USGS Florida Water Science Center | nwisfl.01.95500762 | 1955-04-08 | NaN | NaN | USGS-273926082304501 | NWIS-6842541 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 25.7 degree_Celsius | NaN | NaN |
1465644 | USGS-FL | USGS Florida Water Science Center | nwisfl.01.95500762 | 1955-04-08 | NaN | NaN | USGS-273926082304501 | NWIS-6842544 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7.8 dimensionless |
1465645 | USGS-FL | USGS Florida Water Science Center | nwisfl.01.95800924 | 1957-10-21 | 14:05:00 | EST | USGS-02306001 | NWIS-6894410 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7.7 dimensionless |
1465646 | USGS-FL | USGS Florida Water Science Center | nwisfl.01.95500782 | 1955-04-08 | NaN | NaN | USGS-274455082253601 | NWIS-6842941 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 26.1 degree_Celsius | NaN | NaN |
1465647 | USGS-FL | USGS Florida Water Science Center | nwisfl.01.95500782 | 1955-04-08 | NaN | NaN | USGS-274455082253601 | NWIS-6842944 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7.6 dimensionless |
1465648 rows × 97 columns
[11]:
# Get harmonized stations clipped to the Area of Interest
stations_gdf, stations, site_md = location.get_harmonized_stations(query, aoi=aoi_gdf)
[12]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(df_harmonized, stations_gdf)
legend_kwds={"fmt": "{:.0f}", 'bbox_to_anchor':(1, 0.75)}
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
[12]:
<Axes: >
[13]:
# Map average temperature results at each station
gdf_temperature = visualize.map_measure(df_harmonized, stations_gdf, 'Temperature')
gdf_temperature.plot(column='mean', cmap='OrRd', legend=True)
[13]:
<Axes: >
Clean additional columns of data
[14]:
from harmonize_wq import clean
[15]:
df_cleaned = clean.datetime(df_harmonized) # datetime
df_cleaned = clean.harmonize_depth(df_cleaned) # Sample depth
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/dataretrieval/utils.py:87: UserWarning: Warning: 100662 incomplete dates found, consider setting datetime_index to False.
warnings.warn(
Reduce data to the columns that are most commonly needed
There are many columns in the dataframe that are characteristic specific, that is they have different values for the same sample depending on the characteristic. To ensure one result for each sample after the transformation of the data these columns must either be split, generating a new column for each characteristic with values, or moved out from the table if not being used.
[16]:
# Split single QA column into multiple by characteristic (rename the result to preserve these QA_flags)
df_expanded = wrangle.split_col(df_cleaned)
df_expanded
[16]:
OrganizationIdentifier | OrganizationFormalName | ActivityIdentifier | ActivityStartDate | ActivityStartTime/Time | ActivityStartTime/TimeZoneCode | MonitoringLocationIdentifier | ResultIdentifier | DataLoggerLine | ResultDetectionConditionText | ... | QA_Other_Phosphorus | QA_pH | QA_E_coli | QA_Conductivity | QA_DO | QA_Temperature | QA_Nitrogen | QA_Chlorophyll | QA_Salinity | QA_Turbidity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21FLHILL_WQX | Environmental Protection Commission of Hillsbo... | 21FLHILL_WQX-130612585-W | 2013-06-12 | 11:01:00 | -0500 | 21FLHILL_WQX-585 | STORET-301235413 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 21FLSEAS_WQX | Florida Department of Environmental Protection | 21FLSEAS_WQX-481901119134 | 2013-11-19 | 14:01:00 | -0500 | 21FLSEAS_WQX-48SEAS190 | STORET-310535134 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 21FLHILL_WQX | Environmental Protection Commission of Hillsbo... | 21FLHILL_WQX-130702047-M | 2013-07-02 | 11:01:00 | -0500 | 21FLHILL_WQX-047 | STORET-300620295 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 21FLHILL_WQX | Environmental Protection Commission of Hillsbo... | 21FLHILL_WQX-130716021 | 2013-07-16 | 11:01:00 | -0500 | 21FLHILL_WQX-021 | STORET-300666279 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 21FLHILL_WQX | Environmental Protection Commission of Hillsbo... | 21FLHILL_WQX-131216112-M | 2013-12-16 | 12:01:00 | -0500 | 21FLHILL_WQX-112 | STORET-301229196 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1465643 | USGS-FL | USGS Florida Water Science Center | nwisfl.01.95500762 | 1955-04-08 | NaN | NaN | USGS-273926082304501 | NWIS-6842541 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1465644 | USGS-FL | USGS Florida Water Science Center | nwisfl.01.95500762 | 1955-04-08 | NaN | NaN | USGS-273926082304501 | NWIS-6842544 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1465645 | USGS-FL | USGS Florida Water Science Center | nwisfl.01.95800924 | 1957-10-21 | 14:05:00 | -0500 | USGS-02306001 | NWIS-6894410 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1465646 | USGS-FL | USGS Florida Water Science Center | nwisfl.01.95500782 | 1955-04-08 | NaN | NaN | USGS-274455082253601 | NWIS-6842941 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1465647 | USGS-FL | USGS Florida Water Science Center | nwisfl.01.95500782 | 1955-04-08 | NaN | NaN | USGS-274455082253601 | NWIS-6842944 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1396992 rows × 113 columns
[17]:
# Split table into main with columns of interest and characteristic specific columns/metadata
main_df, chars_df = wrangle.split_table(df_expanded)
[18]:
# Drop empty columns (QA columns without flags)
main_df_small = main_df.dropna(axis=1, how='all')
Transform data from long to wide format
[19]:
transformed_df = wrangle.collapse_results(main_df_small)
Results are collapsed by retaining the first result that isn’t NAN. There can be several reasons for multiple results for the same parameter/characteristic sampled at the same station, time and by the same organization. The collapse_results function assumes the user has already reviewed the quality of all results and narrowed down instances of multiple results to only the desired/best/highest quality result before running this function.