Pensacola Bay FL - Simple workflow
Standardize, clean and wrangle Water Quality Portal data in Pensacola and Perdido Bays into more analytic-ready formats using the harmonize_wq package
US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonixe_wq package is intended to be a flexible water quality specific framework to help:
Identify differences in data units (including speciation and basis)
Identify differences in sampling or analytic methods
Resolve data errors using transparent assumptions
Reduce data to the columns that are most commonly needed
Transform data from long to wide format
Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.
Simple workflow (Temperature and Secchi Disk Depth only)
This example steps through a typical workflow in Pensacola and Perdido Bays to demonstrate commonly used functionality
Install the required libraries
[1]:
import sys
#!python -m pip uninstall harmonize-wq --yes
# Use pip to install the package from pypi or the latest from github
#!{sys.executable} -m pip install harmonize-wq
# For latest dev version
#!{sys.executable} -m pip install git+https://github.com/USEPA/harmonize-wq.git@new_release_0-3-8
dataretrieval Query for a geojson
[2]:
import dataretrieval.wqp as wqp
from harmonize_wq import wrangle
[3]:
# Read geometry for Area of Interest from geojson file url
# NOTE: alternatively you can direct it to a local shapefile
aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson'
[4]:
# Map aoi (geojson is WGS1984 standard)
wrangle.as_gdf(aoi_url).plot()
[4]:
<Axes: >

[5]:
# Build query
query = {'characteristicName': ['Temperature, water',
'Depth, Secchi disk depth',
]}
query['bBox'] = wrangle.get_bounding_box(aoi_url)
query['dataProfile'] = 'narrowResult'
[6]:
# Run query
res_narrow, md_narrow = wqp.get_results(**query)
# dataframe of downloaded results
res_narrow
/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/dataretrieval/wqp.py:153: DtypeWarning: Columns (9,13,17,23,28,31,61,62,71) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv(StringIO(response.text), delimiter=",")
[6]:
OrganizationIdentifier | OrganizationFormalName | ActivityIdentifier | ActivityStartDate | ActivityStartTime/Time | ActivityStartTime/TimeZoneCode | MonitoringLocationIdentifier | ResultIdentifier | DataLoggerLine | ResultDetectionConditionText | ... | AnalysisEndTime/TimeZoneCode | ResultLaboratoryCommentCode | ResultLaboratoryCommentText | ResultDetectionQuantitationLimitUrl | LaboratoryAccreditationIndicator | LaboratoryAccreditationAuthorityName | TaxonomistAccreditationIndicator | TaxonomistAccreditationAuthorityName | LabSamplePreparationUrl | ProviderName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AWW_WQX | Alabama Water Watch | AWW_WQX-aww_0330:20130112134500:SR:WSO | 2013-01-12 | 13:45:00 | CST | AWW_WQX-aww_0330 | STORET-1079461086 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
1 | 21FLCBA_WQX | CHOCTAWHATCHEE BASIN ALLIANCE | 21FLCBA_WQX-BAS219848-162813 | 2013-09-23 | 17:15:00 | CST | 21FLCBA_WQX-BAS02 | STORET-760593202 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
2 | 21FLPNS_WQX | FL Dept. of Environmental Protection, Northwes... | 21FLPNS_WQX-1563531F1 | 2013-12-16 | 10:01:00 | EST | 21FLPNS_WQX-33020J10 | STORET-308098430 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
3 | AWW_WQX | Alabama Water Watch | AWW_WQX-aww_0318:20131027140000:SR:WSO | 2013-10-27 | 14:00:00 | CST | AWW_WQX-aww_0318 | STORET-1079478859 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
4 | 21FLSEAS_WQX | Florida Department of Environmental Protection | 21FLSEAS_WQX-028100514133 | 2013-05-14 | 13:01:00 | EST | 21FLSEAS_WQX-02SEAS810 | STORET-310465573 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
109770 | USGS-AL | USGS Alabama Water Science Center | nwisal.01.99900500 | 1999-03-02 | 14:20:00 | CST | USGS-02376115 | NWIS-104002666 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
109771 | USGS-AL | USGS Alabama Water Science Center | nwisal.01.00201479 | 2001-11-28 | 12:05:00 | CST | USGS-02377570 | NWIS-53918846 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
109772 | USGS-AL | USGS Alabama Water Science Center | nwisal.01.00202076 | 2001-10-03 | 16:40:00 | CDT | USGS-02376115 | NWIS-104000948 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
109773 | USGS-AL | USGS Alabama Water Science Center | nwisal.01.00202072 | 2001-11-28 | 13:45:00 | CST | USGS-02376115 | NWIS-104000936 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
109774 | USGS-AL | USGS Alabama Water Science Center | nwisal.01.00201474 | 2001-10-03 | 14:15:00 | CDT | USGS-02377570 | NWIS-53918826 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
109775 rows × 78 columns
Harmonize and clean all results
[7]:
from harmonize_wq import harmonize
from harmonize_wq import clean
[8]:
# Harmonize all results
df_harmonized = harmonize.harmonize_all(res_narrow, errors='raise')
df_harmonized
/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:360: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'ResultMeasureValue: "Not Reported" result cannot be used' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
df_out.loc[mask & (df_out["QA_flag"].isna()), "QA_flag"] = flag
/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(2.0, 'meter')> <Quantity(0.94, 'meter')>
<Quantity(0.6, 'meter')> ... <Quantity(1.14, 'meter')>
<Quantity(0.61, 'meter')> <Quantity(0.3, 'meter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(26.0555556, 'degree_Celsius')>
<Quantity(12.35, 'degree_Celsius')> <Quantity(23.0, 'degree_Celsius')>
... <Quantity(25.0, 'degree_Celsius')> <Quantity(24.0, 'degree_Celsius')>
<Quantity(20.5, 'degree_Celsius')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
[8]:
OrganizationIdentifier | OrganizationFormalName | ActivityIdentifier | ActivityStartDate | ActivityStartTime/Time | ActivityStartTime/TimeZoneCode | MonitoringLocationIdentifier | ResultIdentifier | DataLoggerLine | ResultDetectionConditionText | ... | ResultDetectionQuantitationLimitUrl | LaboratoryAccreditationIndicator | LaboratoryAccreditationAuthorityName | TaxonomistAccreditationIndicator | TaxonomistAccreditationAuthorityName | LabSamplePreparationUrl | ProviderName | QA_flag | Secchi | Temperature | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AWW_WQX | Alabama Water Watch | AWW_WQX-aww_0330:20130112134500:SR:WSO | 2013-01-12 | 13:45:00 | CST | AWW_WQX-aww_0330 | STORET-1079461086 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | STORET | NaN | 2.0 meter | NaN |
1 | 21FLCBA_WQX | CHOCTAWHATCHEE BASIN ALLIANCE | 21FLCBA_WQX-BAS219848-162813 | 2013-09-23 | 17:15:00 | CST | 21FLCBA_WQX-BAS02 | STORET-760593202 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | STORET | NaN | NaN | 26.0555555555556 degree_Celsius |
2 | 21FLPNS_WQX | FL Dept. of Environmental Protection, Northwes... | 21FLPNS_WQX-1563531F1 | 2013-12-16 | 10:01:00 | EST | 21FLPNS_WQX-33020J10 | STORET-308098430 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | STORET | NaN | NaN | 12.35 degree_Celsius |
3 | AWW_WQX | Alabama Water Watch | AWW_WQX-aww_0318:20131027140000:SR:WSO | 2013-10-27 | 14:00:00 | CST | AWW_WQX-aww_0318 | STORET-1079478859 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | STORET | NaN | NaN | 23.0 degree_Celsius |
4 | 21FLSEAS_WQX | Florida Department of Environmental Protection | 21FLSEAS_WQX-028100514133 | 2013-05-14 | 13:01:00 | EST | 21FLSEAS_WQX-02SEAS810 | STORET-310465573 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | STORET | NaN | NaN | 23.0 degree_Celsius |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
109770 | USGS-AL | USGS Alabama Water Science Center | nwisal.01.99900500 | 1999-03-02 | 14:20:00 | CST | USGS-02376115 | NWIS-104002666 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NWIS | NaN | NaN | 23.0 degree_Celsius |
109771 | USGS-AL | USGS Alabama Water Science Center | nwisal.01.00201479 | 2001-11-28 | 12:05:00 | CST | USGS-02377570 | NWIS-53918846 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NWIS | NaN | NaN | 20.0 degree_Celsius |
109772 | USGS-AL | USGS Alabama Water Science Center | nwisal.01.00202076 | 2001-10-03 | 16:40:00 | CDT | USGS-02376115 | NWIS-104000948 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NWIS | NaN | NaN | 25.0 degree_Celsius |
109773 | USGS-AL | USGS Alabama Water Science Center | nwisal.01.00202072 | 2001-11-28 | 13:45:00 | CST | USGS-02376115 | NWIS-104000936 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NWIS | NaN | NaN | 24.0 degree_Celsius |
109774 | USGS-AL | USGS Alabama Water Science Center | nwisal.01.00201474 | 2001-10-03 | 14:15:00 | CDT | USGS-02377570 | NWIS-53918826 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NWIS | NaN | NaN | 20.5 degree_Celsius |
109775 rows × 81 columns
[9]:
# Clean up other columns of data
df_cleaned = clean.datetime(df_harmonized) # datetime
df_cleaned = clean.harmonize_depth(df_cleaned) # Sample depth
df_cleaned
[9]:
OrganizationIdentifier | OrganizationFormalName | ActivityIdentifier | ActivityStartDate | ActivityStartTime/Time | ActivityStartTime/TimeZoneCode | MonitoringLocationIdentifier | ResultIdentifier | DataLoggerLine | ResultDetectionConditionText | ... | LaboratoryAccreditationAuthorityName | TaxonomistAccreditationIndicator | TaxonomistAccreditationAuthorityName | LabSamplePreparationUrl | ProviderName | QA_flag | Secchi | Temperature | Activity_datetime | Depth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AWW_WQX | Alabama Water Watch | AWW_WQX-aww_0330:20130112134500:SR:WSO | 2013-01-12 | 13:45:00 | CST | AWW_WQX-aww_0330 | STORET-1079461086 | NaN | NaN | ... | NaN | NaN | NaN | NaN | STORET | NaN | 2.0 meter | NaN | 2013-01-12 19:45:00+00:00 | NaN |
1 | 21FLCBA_WQX | CHOCTAWHATCHEE BASIN ALLIANCE | 21FLCBA_WQX-BAS219848-162813 | 2013-09-23 | 17:15:00 | CST | 21FLCBA_WQX-BAS02 | STORET-760593202 | NaN | NaN | ... | NaN | NaN | NaN | NaN | STORET | NaN | NaN | 26.0555555555556 degree_Celsius | 2013-09-23 23:15:00+00:00 | NaN |
2 | 21FLPNS_WQX | FL Dept. of Environmental Protection, Northwes... | 21FLPNS_WQX-1563531F1 | 2013-12-16 | 10:01:00 | EST | 21FLPNS_WQX-33020J10 | STORET-308098430 | NaN | NaN | ... | NaN | NaN | NaN | NaN | STORET | NaN | NaN | 12.35 degree_Celsius | 2013-12-16 15:01:00+00:00 | NaN |
3 | AWW_WQX | Alabama Water Watch | AWW_WQX-aww_0318:20131027140000:SR:WSO | 2013-10-27 | 14:00:00 | CST | AWW_WQX-aww_0318 | STORET-1079478859 | NaN | NaN | ... | NaN | NaN | NaN | NaN | STORET | NaN | NaN | 23.0 degree_Celsius | 2013-10-27 20:00:00+00:00 | NaN |
4 | 21FLSEAS_WQX | Florida Department of Environmental Protection | 21FLSEAS_WQX-028100514133 | 2013-05-14 | 13:01:00 | EST | 21FLSEAS_WQX-02SEAS810 | STORET-310465573 | NaN | NaN | ... | NaN | NaN | NaN | NaN | STORET | NaN | NaN | 23.0 degree_Celsius | 2013-05-14 18:01:00+00:00 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
109770 | USGS-AL | USGS Alabama Water Science Center | nwisal.01.99900500 | 1999-03-02 | 14:20:00 | CST | USGS-02376115 | NWIS-104002666 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NWIS | NaN | NaN | 23.0 degree_Celsius | 1999-03-02 20:20:00+00:00 | NaN |
109771 | USGS-AL | USGS Alabama Water Science Center | nwisal.01.00201479 | 2001-11-28 | 12:05:00 | CST | USGS-02377570 | NWIS-53918846 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NWIS | NaN | NaN | 20.0 degree_Celsius | 2001-11-28 18:05:00+00:00 | NaN |
109772 | USGS-AL | USGS Alabama Water Science Center | nwisal.01.00202076 | 2001-10-03 | 16:40:00 | CDT | USGS-02376115 | NWIS-104000948 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NWIS | NaN | NaN | 25.0 degree_Celsius | 2001-10-03 21:40:00+00:00 | NaN |
109773 | USGS-AL | USGS Alabama Water Science Center | nwisal.01.00202072 | 2001-11-28 | 13:45:00 | CST | USGS-02376115 | NWIS-104000936 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NWIS | NaN | NaN | 24.0 degree_Celsius | 2001-11-28 19:45:00+00:00 | NaN |
109774 | USGS-AL | USGS Alabama Water Science Center | nwisal.01.00201474 | 2001-10-03 | 14:15:00 | CDT | USGS-02377570 | NWIS-53918826 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NWIS | NaN | NaN | 20.5 degree_Celsius | 2001-10-03 19:15:00+00:00 | NaN |
109775 rows × 83 columns
Transform results from long to wide format
There are many columns in the dataframe that are characteristic specific, that is they have different values for the same sample depending on the characteristic. To ensure one result for each sample after the transformation of the data these columns must either be split, generating a new column for each characteristic with values, or moved out from the table if not being used.
[10]:
from harmonize_wq import wrangle
# Split QA column into multiple characteristic specific QA columns
df_full = wrangle.split_col(df_cleaned)
# Divide table into columns of interest (main_df) and characteristic specific metadata (chars_df)
main_df, chars_df = wrangle.split_table(df_full)
# Combine rows with the same sample organization, activity, location, and datetime
df_wide = wrangle.collapse_results(main_df)
# Reduced columns
df_wide.columns
[10]:
Index(['OrganizationFormalName', 'ProviderName', 'Secchi', 'Temperature',
'Depth', 'QA_Temperature', 'QA_Secchi'],
dtype='object')
Results are collapsed by retaining the first result that isn’t NAN. There can be several reasons for multiple results for the same parameter/characteristic sampled at the same station, time and by the same organization. The collapse_results function assumes the user has already reviewed the quality of all results and narrowed down instances of multiple results to only the desired/best/highest quality result before running this function.
Map results
[11]:
from harmonize_wq import location
from harmonize_wq import visualize
# Get harmonized stations clipped to the Area of Interest
stations_gdf, stations, site_md = location.get_harmonized_stations(query, aoi=aoi_url)
/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:356: FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less sequences (e.g. list, tuple) are deprecated and will raise in a future version. Wrap the object in a Series, Index, or np.array before operating instead.
cond_notna = mask & (df_out["QA_flag"].notna()) # Mask cond and not NA
/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:360: FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less sequences (e.g. list, tuple) are deprecated and will raise in a future version. Wrap the object in a Series, Index, or np.array before operating instead.
df_out.loc[mask & (df_out["QA_flag"].isna()), "QA_flag"] = flag
/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:360: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'LatitudeMeasure: Imprecise: lessthan3decimaldigits' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
df_out.loc[mask & (df_out["QA_flag"].isna()), "QA_flag"] = flag
/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:356: FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less sequences (e.g. list, tuple) are deprecated and will raise in a future version. Wrap the object in a Series, Index, or np.array before operating instead.
cond_notna = mask & (df_out["QA_flag"].notna()) # Mask cond and not NA
/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:360: FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less sequences (e.g. list, tuple) are deprecated and will raise in a future version. Wrap the object in a Series, Index, or np.array before operating instead.
df_out.loc[mask & (df_out["QA_flag"].isna()), "QA_flag"] = flag
[12]:
# Map average temperature results at each station
gdf_temperature = visualize.map_measure(df_wide, stations_gdf, 'Temperature')
gdf_temperature.plot(column='mean', cmap='OrRd', legend=True)
[12]:
<Axes: >
