Pensacola Bay FL - Simple workflow

Standardize, clean and wrangle Water Quality Portal data in Pensacola and Perdido Bays into more analytic-ready formats using the harmonize_wq package

US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonixe_wq package is intended to be a flexible water quality specific framework to help:

  • Identify differences in data units (including speciation and basis)

  • Identify differences in sampling or analytic methods

  • Resolve data errors using transparent assumptions

  • Reduce data to the columns that are most commonly needed

  • Transform data from long to wide format

Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.

Simple workflow (Temperature and Secchi Disk Depth only)

This example steps through a typical workflow in Pensacola and Perdido Bays to demonstrate commonly used functionality

Install the required libraries

[1]:
import sys
#!python -m pip uninstall harmonize-wq --yes
# Use pip to install the package from pypi or the latest from github
#!{sys.executable} -m pip install harmonize-wq
# For latest dev version
#!{sys.executable} -m pip install git+https://github.com/USEPA/harmonize-wq.git@new_release_0-3-8

dataretrieval Query for a geojson

[2]:
import dataretrieval.wqp as wqp
from harmonize_wq import wrangle
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/dataretrieval/nadp.py:44: UserWarning: GDAL not installed. Some functions will not work.
  warnings.warn('GDAL not installed. Some functions will not work.')
[3]:
# Read geometry for Area of Interest from geojson file url
# NOTE: alternatively you can direct it to a local shapefile
aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson'
[4]:
# Map aoi (geojson is WGS1984 standard)
wrangle.as_gdf(aoi_url).plot()
[4]:
<Axes: >
../_images/notebooks_Harmonize_Pensacola_Simple_10_1.png
[5]:
# Build query
query = {'characteristicName': ['Temperature, water',
                                'Depth, Secchi disk depth',
                                ]}
query['bBox'] = wrangle.get_bounding_box(aoi_url)
query['dataProfile'] = 'narrowResult'
[6]:
# Run query
res_narrow, md_narrow = wqp.get_results(**query)

# dataframe of downloaded results
res_narrow
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/dataretrieval/wqp.py:83: DtypeWarning: Columns (9,13,17,23,31,61,62,71) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(StringIO(response.text), delimiter=',')
[6]:
OrganizationIdentifier OrganizationFormalName ActivityIdentifier ActivityStartDate ActivityStartTime/Time ActivityStartTime/TimeZoneCode MonitoringLocationIdentifier ResultIdentifier DataLoggerLine ResultDetectionConditionText ... AnalysisEndTime/TimeZoneCode ResultLaboratoryCommentCode ResultLaboratoryCommentText ResultDetectionQuantitationLimitUrl LaboratoryAccreditationIndicator LaboratoryAccreditationAuthorityName TaxonomistAccreditationIndicator TaxonomistAccreditationAuthorityName LabSamplePreparationUrl ProviderName
0 21FLCBA_WQX CHOCTAWHATCHEE BASIN ALLIANCE 21FLCBA_WQX-BAS219848-162813 2013-09-23 17:15:00 CST 21FLCBA_WQX-BAS02 STORET-760593202 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
1 21FLPNS_WQX FL Dept. of Environmental Protection, Northwes... 21FLPNS_WQX-1563531F1 2013-12-16 10:01:00 EST 21FLPNS_WQX-33020J10 STORET-308098430 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
2 21FLSEAS_WQX Florida Department of Environmental Protection 21FLSEAS_WQX-028100514133 2013-05-14 13:01:00 EST 21FLSEAS_WQX-02SEAS810 STORET-310465573 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
3 21FLPNS_WQX FL Dept. of Environmental Protection, Northwes... 21FLPNS_WQX-1485424F1 2013-02-20 12:01:00 EST 21FLPNS_WQX-33010G10 STORET-308175318 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
4 AWW_WQX Alabama Water Watch AWW_WQX-aww_2174:20130928093000:SR:SYR 2013-09-28 09:30:00 CST AWW_WQX-aww_2174 STORET-1003650188 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
97984 21FLPNS_WQX FL Dept. of Environmental Protection, Northwes... 21FLPNS_WQX-G4NW0022-02/21/24-FP1 2024-02-21 13:55:00 EST 21FLPNS_WQX-G4NW0022 STORET-1042728107 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
97985 21FLESC_WQX ESCAMBIA COUNTY 21FLESC_WQX-GRANDEBAYOU50-2/13/2024-F 2024-02-13 10:35:00 CST 21FLESC_WQX-GRANDEBAYOU50 STORET-1044550197 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
97986 21FLESC_WQX ESCAMBIA COUNTY 21FLESC_WQX-CARPENTERCR40-1/22/2024-F 2024-01-22 15:05:00 CST 21FLESC_WQX-CARPENTERCR40 STORET-1044573022 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
97987 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-335647_1872428_173 2024-03-06 09:20:00 CDT 21AWIC-7290 STORET-1039485360 1872428.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
97988 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-335642_1872423_173 2024-03-06 07:45:00 CDT 21AWIC-1208 STORET-1039486410 1872423.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET

97989 rows × 78 columns

Harmonize and clean all results

[7]:
from harmonize_wq import harmonize
from harmonize_wq import clean
[8]:
# Harmonize all results
df_harmonized = harmonize.harmonize_all(res_narrow, errors='raise')
df_harmonized
[8]:
OrganizationIdentifier OrganizationFormalName ActivityIdentifier ActivityStartDate ActivityStartTime/Time ActivityStartTime/TimeZoneCode MonitoringLocationIdentifier ResultIdentifier DataLoggerLine ResultDetectionConditionText ... ResultDetectionQuantitationLimitUrl LaboratoryAccreditationIndicator LaboratoryAccreditationAuthorityName TaxonomistAccreditationIndicator TaxonomistAccreditationAuthorityName LabSamplePreparationUrl ProviderName QA_flag Secchi Temperature
0 21FLCBA_WQX CHOCTAWHATCHEE BASIN ALLIANCE 21FLCBA_WQX-BAS219848-162813 2013-09-23 17:15:00 CST 21FLCBA_WQX-BAS02 STORET-760593202 NaN NaN ... NaN NaN NaN NaN NaN NaN STORET NaN NaN 26.0555555555556 degree_Celsius
1 21FLPNS_WQX FL Dept. of Environmental Protection, Northwes... 21FLPNS_WQX-1563531F1 2013-12-16 10:01:00 EST 21FLPNS_WQX-33020J10 STORET-308098430 NaN NaN ... NaN NaN NaN NaN NaN NaN STORET NaN NaN 12.35 degree_Celsius
2 21FLSEAS_WQX Florida Department of Environmental Protection 21FLSEAS_WQX-028100514133 2013-05-14 13:01:00 EST 21FLSEAS_WQX-02SEAS810 STORET-310465573 NaN NaN ... NaN NaN NaN NaN NaN NaN STORET NaN NaN 23.0 degree_Celsius
3 21FLPNS_WQX FL Dept. of Environmental Protection, Northwes... 21FLPNS_WQX-1485424F1 2013-02-20 12:01:00 EST 21FLPNS_WQX-33010G10 STORET-308175318 NaN NaN ... NaN NaN NaN NaN NaN NaN STORET NaN NaN 11.23 degree_Celsius
4 AWW_WQX Alabama Water Watch AWW_WQX-aww_2174:20130928093000:SR:SYR 2013-09-28 09:30:00 CST AWW_WQX-aww_2174 STORET-1003650188 NaN NaN ... NaN NaN NaN NaN NaN NaN STORET NaN NaN 27.0 degree_Celsius
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
97984 21FLPNS_WQX FL Dept. of Environmental Protection, Northwes... 21FLPNS_WQX-G4NW0022-02/21/24-FP1 2024-02-21 13:55:00 EST 21FLPNS_WQX-G4NW0022 STORET-1042728107 NaN NaN ... NaN NaN NaN NaN NaN NaN STORET NaN 0.58 meter NaN
97985 21FLESC_WQX ESCAMBIA COUNTY 21FLESC_WQX-GRANDEBAYOU50-2/13/2024-F 2024-02-13 10:35:00 CST 21FLESC_WQX-GRANDEBAYOU50 STORET-1044550197 NaN NaN ... NaN NaN NaN NaN NaN NaN STORET NaN 2.2 meter NaN
97986 21FLESC_WQX ESCAMBIA COUNTY 21FLESC_WQX-CARPENTERCR40-1/22/2024-F 2024-01-22 15:05:00 CST 21FLESC_WQX-CARPENTERCR40 STORET-1044573022 NaN NaN ... NaN NaN NaN NaN NaN NaN STORET NaN NaN 9.07 degree_Celsius
97987 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-335647_1872428_173 2024-03-06 09:20:00 CDT 21AWIC-7290 STORET-1039485360 1872428.0 NaN ... NaN NaN NaN NaN NaN NaN STORET NaN NaN 18.8128 degree_Celsius
97988 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-335642_1872423_173 2024-03-06 07:45:00 CDT 21AWIC-1208 STORET-1039486410 1872423.0 NaN ... NaN NaN NaN NaN NaN NaN STORET NaN NaN 19.1005 degree_Celsius

97989 rows × 81 columns

[9]:
# Clean up other columns of data
df_cleaned = clean.datetime(df_harmonized)  # datetime
df_cleaned = clean.harmonize_depth(df_cleaned)  # Sample depth
df_cleaned
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/dataretrieval/utils.py:87: UserWarning: Warning: 9057 incomplete dates found, consider setting datetime_index to False.
  warnings.warn(
[9]:
OrganizationIdentifier OrganizationFormalName ActivityIdentifier ActivityStartDate ActivityStartTime/Time ActivityStartTime/TimeZoneCode MonitoringLocationIdentifier ResultIdentifier DataLoggerLine ResultDetectionConditionText ... LaboratoryAccreditationAuthorityName TaxonomistAccreditationIndicator TaxonomistAccreditationAuthorityName LabSamplePreparationUrl ProviderName QA_flag Secchi Temperature Activity_datetime Depth
0 21FLCBA_WQX CHOCTAWHATCHEE BASIN ALLIANCE 21FLCBA_WQX-BAS219848-162813 2013-09-23 17:15:00 -0600 21FLCBA_WQX-BAS02 STORET-760593202 NaN NaN ... NaN NaN NaN NaN STORET NaN NaN 26.0555555555556 degree_Celsius 2013-09-23 23:15:00+00:00 NaN
1 21FLPNS_WQX FL Dept. of Environmental Protection, Northwes... 21FLPNS_WQX-1563531F1 2013-12-16 10:01:00 -0500 21FLPNS_WQX-33020J10 STORET-308098430 NaN NaN ... NaN NaN NaN NaN STORET NaN NaN 12.35 degree_Celsius 2013-12-16 15:01:00+00:00 NaN
2 21FLSEAS_WQX Florida Department of Environmental Protection 21FLSEAS_WQX-028100514133 2013-05-14 13:01:00 -0500 21FLSEAS_WQX-02SEAS810 STORET-310465573 NaN NaN ... NaN NaN NaN NaN STORET NaN NaN 23.0 degree_Celsius 2013-05-14 18:01:00+00:00 NaN
3 21FLPNS_WQX FL Dept. of Environmental Protection, Northwes... 21FLPNS_WQX-1485424F1 2013-02-20 12:01:00 -0500 21FLPNS_WQX-33010G10 STORET-308175318 NaN NaN ... NaN NaN NaN NaN STORET NaN NaN 11.23 degree_Celsius 2013-02-20 17:01:00+00:00 NaN
4 AWW_WQX Alabama Water Watch AWW_WQX-aww_2174:20130928093000:SR:SYR 2013-09-28 09:30:00 -0600 AWW_WQX-aww_2174 STORET-1003650188 NaN NaN ... NaN NaN NaN NaN STORET NaN NaN 27.0 degree_Celsius 2013-09-28 15:30:00+00:00 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
97984 21FLPNS_WQX FL Dept. of Environmental Protection, Northwes... 21FLPNS_WQX-G4NW0022-02/21/24-FP1 2024-02-21 13:55:00 -0500 21FLPNS_WQX-G4NW0022 STORET-1042728107 NaN NaN ... NaN NaN NaN NaN STORET NaN 0.58 meter NaN 2024-02-21 18:55:00+00:00 NaN
97985 21FLESC_WQX ESCAMBIA COUNTY 21FLESC_WQX-GRANDEBAYOU50-2/13/2024-F 2024-02-13 10:35:00 -0600 21FLESC_WQX-GRANDEBAYOU50 STORET-1044550197 NaN NaN ... NaN NaN NaN NaN STORET NaN 2.2 meter NaN 2024-02-13 16:35:00+00:00 NaN
97986 21FLESC_WQX ESCAMBIA COUNTY 21FLESC_WQX-CARPENTERCR40-1/22/2024-F 2024-01-22 15:05:00 -0600 21FLESC_WQX-CARPENTERCR40 STORET-1044573022 NaN NaN ... NaN NaN NaN NaN STORET NaN NaN 9.07 degree_Celsius 2024-01-22 21:05:00+00:00 NaN
97987 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-335647_1872428_173 2024-03-06 09:20:00 -0500 21AWIC-7290 STORET-1039485360 1872428.0 NaN ... NaN NaN NaN NaN STORET NaN NaN 18.8128 degree_Celsius 2024-03-06 14:20:00+00:00 NaN
97988 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-335642_1872423_173 2024-03-06 07:45:00 -0500 21AWIC-1208 STORET-1039486410 1872423.0 NaN ... NaN NaN NaN NaN STORET NaN NaN 19.1005 degree_Celsius 2024-03-06 12:45:00+00:00 NaN

97989 rows × 83 columns

Transform results from long to wide format

There are many columns in the dataframe that are characteristic specific, that is they have different values for the same sample depending on the characteristic. To ensure one result for each sample after the transformation of the data these columns must either be split, generating a new column for each characteristic with values, or moved out from the table if not being used.

[10]:
from harmonize_wq import wrangle

# Split QA column into multiple characteristic specific QA columns
df_full = wrangle.split_col(df_cleaned)

# Divide table into columns of interest (main_df) and characteristic specific metadata (chars_df)
main_df, chars_df = wrangle.split_table(df_full)

# Combine rows with the same sample organization, activity, location, and datetime
df_wide = wrangle.collapse_results(main_df)

# Reduced columns
df_wide.columns
[10]:
Index(['OrganizationFormalName', 'ProviderName', 'Secchi', 'Temperature',
       'Depth', 'QA_Secchi', 'QA_Temperature'],
      dtype='object')

Results are collapsed by retaining the first result that isn’t NAN. There can be several reasons for multiple results for the same parameter/characteristic sampled at the same station, time and by the same organization. The collapse_results function assumes the user has already reviewed the quality of all results and narrowed down instances of multiple results to only the desired/best/highest quality result before running this function.

Map results

[11]:
from harmonize_wq import location
from harmonize_wq import visualize

# Get harmonized stations clipped to the Area of Interest
stations_gdf, stations, site_md = location.get_harmonized_stations(query, aoi=aoi_url)
[12]:
# Map average temperature results at each station
gdf_temperature = visualize.map_measure(df_wide, stations_gdf, 'Temperature')
gdf_temperature.plot(column='mean', cmap='OrRd', legend=True)
[12]:
<Axes: >
../_images/notebooks_Harmonize_Pensacola_Simple_23_1.png