Cape Cod - Detailed step-by-step

Standardize, clean and wrangle Water Quality Portal data in Cape Cod into more analytic-ready formats using the harmonize_wq package

US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonixe_wq package is intended to be a flexible water quality specific framework to help:

  • Identify differences in data units (including speciation and basis)

  • Identify differences in sampling or analytic methods

  • Resolve data errors using transparent assumptions

  • Reduce data to the columns that are most commonly needed

  • Transform data from long to wide format

Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.

Detailed step-by-step workflow

This example workflow takes a deeper dive into some of the expanded functionality to examine results for different water quality parameters in Cape Cod

Install and import the required libraries

[1]:
import sys
#! python -m pip uninstall harmonize-wq --yes
# Use pip to install the package from pypi or the latest from github
#!{sys.executable} -m pip install harmonize-wq
# For latest dev version
#!{sys.executable} -m pip install git+https://github.com/USEPA/harmonize-wq.git
[2]:
import dataretrieval.wqp as wqp
from harmonize_wq import wrangle
from harmonize_wq import location
from harmonize_wq import harmonize
from harmonize_wq import visualize
from harmonize_wq import clean

Download location data using dataretrieval

[3]:
# Read geometry for Area of Interest from geojson file url and plot
aoi_url = 'https://github.com/jbousquin/test_notebook/raw/main/temperature_data/NewEngland.geojson'
aoi_gdf = wrangle.as_gdf(aoi_url)  # Already 4326 standard
aoi_gdf.plot()
[3]:
<Axes: >
../_images/notebooks_Harmonize_CapeCod_Detailed_9_1.png
[4]:
# Build query with characteristicNames and the AOI extent
query = {'characteristicName': ['Phosphorus',
                                'Temperature, water',
                                'Depth, Secchi disk depth',
                                'Dissolved oxygen (DO)',
                                'Salinity',
                                'pH',
                                'Nitrogen',
                                'Conductivity',
                                'Organic carbon',
                                'Chlorophyll a',
                                'Turbidity',
                                'Sediment',
                                'Fecal Coliform',
                                'Escherichia coli']}
query['bBox'] = wrangle.get_bounding_box(aoi_gdf)
[5]:
# Query stations (can be slow)
stations, site_md = wqp.what_sites(**query)
[6]:
# Rows and columns for results
stations.shape
[6]:
(11369, 37)
[7]:
# First 5 rows
stations.head()
[7]:
OrganizationIdentifier OrganizationFormalName MonitoringLocationIdentifier MonitoringLocationName MonitoringLocationTypeName MonitoringLocationDescriptionText HUCEightDigitCode DrainageAreaMeasure/MeasureValue DrainageAreaMeasure/MeasureUnitCode ContributingDrainageAreaMeasure/MeasureValue ... AquiferName LocalAqfrName FormationTypeText AquiferTypeName ConstructionDateText WellDepthMeasure/MeasureValue WellDepthMeasure/MeasureUnitCode WellHoleDepthMeasure/MeasureValue WellHoleDepthMeasure/MeasureUnitCode ProviderName
0 USGS-MA USGS Massachusetts Water Science Center USGS-010965305 MERRIMACK R NR TYNGSBOROUGH BRIDGE TYNGSBOROUG... Stream NaN 1070006.0 4070.00 sq mi NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
1 USGS-MA USGS Massachusetts Water Science Center USGS-01096544 STONY BROOK AT SCHOOL STREET AT CHELMSFORD, MA Stream NaN 1070006.0 41.57 sq mi NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
2 USGS-MA USGS Massachusetts Water Science Center USGS-01096546 STONY BROOK AT CHELMSFORD, MA Stream NaN 1070006.0 43.60 sq mi NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
3 USGS-MA USGS Massachusetts Water Science Center USGS-01096548 STONY BROOK AT N CHELMSFORD, MA Stream NaN 1070006.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
4 USGS-MA USGS Massachusetts Water Science Center USGS-01096550 MERRIMACK RIVER ABOVE LOWELL, MA Stream NaN 1070006.0 3900.00 sq mi 3900.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS

5 rows × 37 columns

[8]:
# Columns used for an example row
stations.iloc[0][['HorizontalCoordinateReferenceSystemDatumName', 'LatitudeMeasure', 'LongitudeMeasure']]
[8]:
HorizontalCoordinateReferenceSystemDatumName        NAD83
LatitudeMeasure                                 42.677389
LongitudeMeasure                               -71.421056
Name: 0, dtype: object
[9]:
# Harmonize location datums to 4326 (Note we keep intermediate columns using intermediate_columns=True)
stations_gdf = location.harmonize_locations(stations, outEPSG=4326, intermediate_columns=True)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:356: FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less sequences (e.g. list, tuple) are deprecated and will raise in a future version. Wrap the object in a Series, Index, or np.array before operating instead.
  cond_notna = mask & (df_out["QA_flag"].notna())  # Mask cond and not NA
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:360: FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less sequences (e.g. list, tuple) are deprecated and will raise in a future version. Wrap the object in a Series, Index, or np.array before operating instead.
  df_out.loc[mask & (df_out["QA_flag"].isna()), "QA_flag"] = flag
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:360: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'LatitudeMeasure: Imprecise: lessthan3decimaldigits' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[mask & (df_out["QA_flag"].isna()), "QA_flag"] = flag
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:356: FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less sequences (e.g. list, tuple) are deprecated and will raise in a future version. Wrap the object in a Series, Index, or np.array before operating instead.
  cond_notna = mask & (df_out["QA_flag"].notna())  # Mask cond and not NA
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:360: FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less sequences (e.g. list, tuple) are deprecated and will raise in a future version. Wrap the object in a Series, Index, or np.array before operating instead.
  df_out.loc[mask & (df_out["QA_flag"].isna()), "QA_flag"] = flag
[10]:
# Every function has a dostring to help understand input/output and what it does
location.harmonize_locations?
[11]:
# Rows and columns for results after running the function (5 new columns, only 2 new if intermediate_columns=False)
stations_gdf.shape
[11]:
(11369, 42)
[12]:
# Example results for the new columns
stations_gdf.iloc[0][['geom_orig', 'EPSG', 'QA_flag', 'geom', 'geometry']]
[12]:
geom_orig         (-71.4210556, 42.67738889)
EPSG                                  4269.0
QA_flag                                  NaN
geom         POINT (-71.4210556 42.67738889)
geometry     POINT (-71.4210556 42.67738889)
Name: 0, dtype: object
[13]:
# geom and geometry look the same but geometry is a special datatype
stations_gdf['geometry'].dtype
[13]:
<geopandas.array.GeometryDtype at 0x7ffa7f5e5760>
[14]:
# Look at the different QA_flag flags that have been assigned,
# e.g., for bad datums or limited decimal precision
set(stations_gdf.loc[stations_gdf['QA_flag'].notna()]['QA_flag'])
[14]:
{'HorizontalCoordinateReferenceSystemDatumName: Bad datum OTHER, EPSG:4326 assumed',
 'HorizontalCoordinateReferenceSystemDatumName: Bad datum UNKWN, EPSG:4326 assumed',
 'LatitudeMeasure: Imprecise: lessthan3decimaldigits',
 'LatitudeMeasure: Imprecise: lessthan3decimaldigits; LongitudeMeasure: Imprecise: lessthan3decimaldigits',
 'LongitudeMeasure: Imprecise: lessthan3decimaldigits'}
[15]:
# Map it
stations_gdf.plot()
[15]:
<Axes: >
../_images/notebooks_Harmonize_CapeCod_Detailed_21_1.png
[16]:
# Clip to area of interest
stations_clipped = wrangle.clip_stations(stations_gdf, aoi_gdf)
[17]:
# Map it
stations_clipped.plot()
[17]:
<Axes: >
../_images/notebooks_Harmonize_CapeCod_Detailed_23_1.png
[18]:
# How many stations now?
len(stations_clipped)
[18]:
1929
[19]:
# To save the results to a shapefile
#import os
#path = ''  #specify the path (folder/directory) to save it to
#stations_clipped.to_file(os.path.join(path,'CapeCod_stations.shp'))

Retrieve Characteristic Data

[20]:
# Now query for results
query['dataProfile'] = 'narrowResult'
res_narrow, md_narrow = wqp.get_results(**query)
[21]:
df = res_narrow
df
[21]:
OrganizationIdentifier OrganizationFormalName ActivityIdentifier ActivityStartDate ActivityStartTime/Time ActivityStartTime/TimeZoneCode MonitoringLocationIdentifier ResultIdentifier DataLoggerLine ResultDetectionConditionText ... ResultDetectionQuantitationLimitUrl LaboratoryAccreditationIndicator LaboratoryAccreditationAuthorityName TaxonomistAccreditationIndicator TaxonomistAccreditationAuthorityName LabSamplePreparationUrl ProviderName ActivityStartDateTime AnalysisStartDateTime AnalysisEndDateTime
0 USGS-MA USGS Massachusetts Water Science Center nwisma.01.90100001 1901-08-09 09:00:00 EST USGS-01116500 NWIS-66459391 NaN NaN ... NaN NaN NaN NaN NaN NaN NWIS 1901-08-09 14:00:00+00:00 NaT NaT
1 USGS-MA USGS Massachusetts Water Science Center nwisma.01.90100001 1901-08-09 09:00:00 EST USGS-01116500 NWIS-66459396 NaN NaN ... NaN NaN NaN NaN NaN NaN NWIS 1901-08-09 14:00:00+00:00 NaT NaT
2 USGS-MA USGS Massachusetts Water Science Center nwisma.01.90100001 1901-08-09 09:00:00 EST USGS-01116500 NWIS-66459398 NaN NaN ... NaN NaN NaN NaN NaN NaN NWIS 1901-08-09 14:00:00+00:00 NaT NaT
3 USGS-MA USGS Massachusetts Water Science Center nwisma.01.95300021 1952-10-21 09:00:00 EST USGS-01104500 NWIS-66461873 NaN NaN ... NaN NaN NaN NaN NaN NaN NWIS 1952-10-21 14:00:00+00:00 NaT NaT
4 USGS-MA USGS Massachusetts Water Science Center nwisma.01.95300023 1952-10-24 11:45:00 EST USGS-01108000 NWIS-66461945 NaN NaN ... NaN NaN NaN NaN NaN NaN NWIS 1952-10-24 16:45:00+00:00 NaT NaT
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
577222 11NPSWRD_WQX National Park Service Water Resources Division 11NPSWRD_WQX-CACO_DUCK_W_19560816_731708 1956-08-16 NaN NaN 11NPSWRD_WQX-CACO_DUCK_W STORET-740598866 NaN NaN ... https://www.waterqualitydata.us/data/providers... NaN NaN NaN NaN NaN STORET NaT NaT NaT
577223 11NPSWRD_WQX National Park Service Water Resources Division 11NPSWRD_WQX-CACO_DUCK_W_19560616_731704 1956-06-16 NaN NaN 11NPSWRD_WQX-CACO_DUCK_W STORET-740598862 NaN NaN ... https://www.waterqualitydata.us/data/providers... NaN NaN NaN NaN NaN STORET NaT NaT NaT
577224 11NPSWRD_WQX National Park Service Water Resources Division 11NPSWRD_WQX-CACO_DUCK_W_19560701_731705 1956-07-01 NaN NaN 11NPSWRD_WQX-CACO_DUCK_W STORET-740598863 NaN NaN ... https://www.waterqualitydata.us/data/providers... NaN NaN NaN NaN NaN STORET NaT NaT NaT
577225 11NPSWRD_WQX National Park Service Water Resources Division 11NPSWRD_WQX-CACO_GREAT_W_19550816_731703 1955-08-16 NaN NaN 11NPSWRD_WQX-CACO_GREAT_W STORET-740649462 NaN NaN ... https://www.waterqualitydata.us/data/providers... NaN NaN NaN NaN NaN STORET NaT NaT NaT
577226 11NPSWRD_WQX National Park Service Water Resources Division 11NPSWRD_WQX-CACO_SLOUGH_19520816_731702 1952-08-16 NaN NaN 11NPSWRD_WQX-CACO_SLOUGH STORET-740745800 NaN Not Detected ... https://www.waterqualitydata.us/data/providers... NaN NaN NaN NaN NaN STORET NaT NaT NaT

577227 rows × 81 columns

[22]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(df, stations_clipped)
legend_kwds = {"fmt": "{:.0f}", 'bbox_to_anchor':(1, 0.75)}
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
[22]:
<Axes: >
../_images/notebooks_Harmonize_CapeCod_Detailed_29_1.png

Harmonize Characteristic Results

Two options for functions to harmonize characteristics: harmonize_all() or harmonize_generic(). harmonize_all runs functions on all characteristics and lets you specify how to handle errors harmonize_generic runs functions only on the characteristic specified with char_val and lets you also choose output units, to keep intermediate columns and to do a quick report summarizing changes.

[23]:
# See Documentation
#harmonize_WQP.harmonize_all?
#harmonize_WQP.harmonize?

secchi disk depth

[24]:
# Each harmonize function has optional params, e.g., char_val is the characticName column value to use so we can send the entire df.
# Optional params: units='m', char_val='Depth, Secchi disk depth', out_col='Secchi', report=False)

# We start by demonstrating on secchi disk depth (units default to m, keep intermediate fields, see report)
df = harmonize.harmonize(df, "Depth, Secchi disk depth", errors="ignore", intermediate_columns=True, report=True)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:360: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'ResultMeasureValue: "BV" result cannot be used' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[mask & (df_out["QA_flag"].isna()), "QA_flag"] = flag
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: '%' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(1.9812, 'meter')> <Quantity(3.9624, 'meter')>
 <Quantity(3.5052, 'meter')> ... <Quantity(17.0, 'meter')>
 <Quantity(16.0, 'meter')> <Quantity(7.8, 'meter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count    8961.000000
mean       -5.920774
std       298.744632
min     -9999.000000
25%         1.400000
50%         2.000000
75%         3.962400
max        27.000000
dtype: float64
Unusable results: 2231
Usable results with inferred units: 8
Results outside threshold (0.0 to 1786.5470213513204): 16
../_images/notebooks_Harmonize_CapeCod_Detailed_34_2.png

The threshold is based on standard deviations and is currently only used in the histogram.

[25]:
# Look at a table of just Secchi results and focus on subset of columns
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Units']
sechi_results = df.loc[df['CharacteristicName']=='Depth, Secchi disk depth', cols + ['Secchi']]
sechi_results
[25]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Units Secchi
11317 11NPSWRD_WQX-SAMA_MEQ76_SB03 6.5 ft NaN ft 1.9811999999999999 meter
11334 11NPSWRD_WQX-SAMA_MEQ76_SB06 13.0 ft NaN ft 3.9623999999999997 meter
11376 11NPSWRD_WQX-SAMA_MEQ76_SB01 11.5 ft NaN ft 3.5051999999999994 meter
11401 11NPSWRD_WQX-SAMA_MEQ76_SB03 20.0 ft NaN ft 6.095999999999999 meter
11415 11NPSWRD_WQX-SAMA_MEQ76_SB05 11.0 ft NaN ft 3.3527999999999993 meter
... ... ... ... ... ... ...
577222 11NPSWRD_WQX-CACO_DUCK_W 10 m NaN m 10.0 meter
577223 11NPSWRD_WQX-CACO_DUCK_W 17 m NaN m 17.0 meter
577224 11NPSWRD_WQX-CACO_DUCK_W 16 m NaN m 16.0 meter
577225 11NPSWRD_WQX-CACO_GREAT_W 7.8 m NaN m 7.8 meter
577226 11NPSWRD_WQX-CACO_SLOUGH NaN m ResultMeasureValue: missing (NaN) result m NaN

11192 rows × 6 columns

[26]:
# Look at unusable(NAN) results
sechi_results.loc[df['Secchi'].isna()]
[26]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Units Secchi
27448 11113300-02-IPB BV m ResultMeasureValue: "BV" result cannot be used m NaN
32889 11113300-07A-MER BV m ResultMeasureValue: "BV" result cannot be used m NaN
99878 11113300-GRTKINSD =3.4 m ResultMeasureValue: "=3.4" result cannot be used m NaN
100885 11113300-CUBSDND =1.7 m ResultMeasureValue: "=1.7" result cannot be used m NaN
100978 11113300-CUBLDVLD =1.4 m ResultMeasureValue: "=1.4" result cannot be used m NaN
... ... ... ... ... ... ...
563277 11NPSWRD_WQX-CACO_GREAT_W NaN m ResultMeasureValue: missing (NaN) result m NaN
563944 11NPSWRD_WQX-CACO_GREAT_W NaN m ResultMeasureValue: missing (NaN) result m NaN
565619 11113300-DARLITD NaN NaN ResultMeasureValue: missing (NaN) result; Resu... m NaN
569087 11113300-WORSALD NaN NaN ResultMeasureValue: missing (NaN) result; Resu... m NaN
577226 11NPSWRD_WQX-CACO_SLOUGH NaN m ResultMeasureValue: missing (NaN) result m NaN

2231 rows × 6 columns

[27]:
# look at the QA flag for first row from above
list(sechi_results.loc[df['Secchi'].isna()]['QA_flag'])[0]
[27]:
'ResultMeasureValue: "BV" result cannot be used'
[28]:
# All cases where there was a QA flag
sechi_results.loc[df['QA_flag'].notna()]
[28]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Units Secchi
27448 11113300-02-IPB BV m ResultMeasureValue: "BV" result cannot be used m NaN
32889 11113300-07A-MER BV m ResultMeasureValue: "BV" result cannot be used m NaN
99878 11113300-GRTKINSD =3.4 m ResultMeasureValue: "=3.4" result cannot be used m NaN
100885 11113300-CUBSDND =1.7 m ResultMeasureValue: "=1.7" result cannot be used m NaN
100978 11113300-CUBLDVLD =1.4 m ResultMeasureValue: "=1.4" result cannot be used m NaN
... ... ... ... ... ... ...
563277 11NPSWRD_WQX-CACO_GREAT_W NaN m ResultMeasureValue: missing (NaN) result m NaN
563944 11NPSWRD_WQX-CACO_GREAT_W NaN m ResultMeasureValue: missing (NaN) result m NaN
565619 11113300-DARLITD NaN NaN ResultMeasureValue: missing (NaN) result; Resu... m NaN
569087 11113300-WORSALD NaN NaN ResultMeasureValue: missing (NaN) result; Resu... m NaN
577226 11NPSWRD_WQX-CACO_SLOUGH NaN m ResultMeasureValue: missing (NaN) result m NaN

1273 rows × 6 columns

If both value and unit are missing nothing can be done, a unitless (NaN) value is assumed as to be in default units but a QA_flag is added

[29]:
# Aggregate secchi data by station
visualize.station_summary(sechi_results, 'Secchi')
[29]:
MonitoringLocationIdentifier cnt mean
0 11113300-ANGSDND 45 3.765500
1 11113300-ARLSALD 3 3.400000
2 11113300-BARKIND 1 2.100000
3 11113300-BAYKIND 1 1.900000
4 11113300-BEADERD 46 3.666304
... ... ... ...
983 WWMD_VA-SH1 2 1.500000
984 WWMD_VA-SH2 2 1.800000
985 WWMD_VA-SR6A 1 0.500000
986 WWMD_VA-WF2 1 0.600000
987 WWMD_VA-WR5 1 2.000000

988 rows × 3 columns

[30]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(sechi_results, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/mapclassify/classifiers.py:1653: UserWarning: Not enough unique values in array to form 5 classes. Setting k to 4.
  self.bins = quantile(y, k=k)
[30]:
<Axes: >
../_images/notebooks_Harmonize_CapeCod_Detailed_42_2.png
[31]:
# Map average results at each station
gdf_avg = visualize.map_measure(sechi_results, stations_clipped, 'Secchi')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[31]:
<Axes: >
../_images/notebooks_Harmonize_CapeCod_Detailed_43_1.png

Temperature

The default error=’raise’, makes it so that there is an error when there is a dimensionality error (i.e. when units can’t be converted). Here we would get the error: DimensionalityError: Cannot convert from ‘count’ (dimensionless) to ‘degree_Celsius’ ([temperature])

[32]:
#'Temperature, water'
# errors=‘ignore’, invalid dimension conversions will return the NaN.
df = harmonize.harmonize(df, 'Temperature, water', intermediate_columns=True, report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'count' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(2.0, 'degree_Celsius')> <Quantity(1.0, 'degree_Celsius')>
 <Quantity(16.5, 'degree_Celsius')> ... <Quantity(4.0, 'degree_Celsius')>
 <Quantity(1.5, 'degree_Celsius')> <Quantity(3.9, 'degree_Celsius')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count    151125.000000
mean         16.311906
std           7.273072
min          -9.830000
25%          11.260000
50%          17.300000
75%          21.910000
max         910.000000
dtype: float64
Unusable results: 226
Usable results with inferred units: 217
Results outside threshold (0.0 to 59.95033855767852): 477
../_images/notebooks_Harmonize_CapeCod_Detailed_46_2.png
[33]:
# Look at what was changed
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Temperature', 'Units']
temperature_results = df.loc[df['CharacteristicName']=='Temperature, water', cols]
temperature_results
[33]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Temperature Units
0 USGS-01116500 2.0 deg C NaN 2.0 degree_Celsius degC
5 USGS-01104200 1.0 deg C NaN 1.0 degree_Celsius degC
13 USGS-01105500 16.5 deg C NaN 16.5 degree_Celsius degC
14 USGS-01100000 17.0 deg C NaN 17.0 degree_Celsius degC
15 USGS-01108500 17.0 deg C NaN 17.0 degree_Celsius degC
... ... ... ... ... ... ...
577207 11NPSWRD_WQX-CACO_GULL 4.8 deg C NaN 4.8 degree_Celsius degC
577208 11NPSWRD_WQX-CACO_GULL 9.1 deg C NaN 9.1 degree_Celsius degC
577209 11NPSWRD_WQX-CACO_GULL 4 deg C NaN 4.0 degree_Celsius degC
577210 11NPSWRD_WQX-CACO_GULL 1.5 deg C NaN 1.5 degree_Celsius degC
577211 11NPSWRD_WQX-CACO_GULL 3.9 deg C NaN 3.9 degree_Celsius degC

151351 rows × 6 columns

[34]:
# Examine deg F
temperature_results.loc[df['ResultMeasure/MeasureUnitCode'] == 'deg F']
[34]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Temperature Units
11300 11NPSWRD_WQX-SAMA_MEQ76_SB01 59 deg F NaN 15.000000000000057 degree_Celsius degF
11302 11NPSWRD_WQX-SAMA_MEQ76_SB02 60 deg F NaN 15.5555555555556 degree_Celsius degF
11307 11NPSWRD_WQX-SAMA_MEQ76_SB02 58 deg F NaN 14.444444444444457 degree_Celsius degF
11310 11NPSWRD_WQX-SAMA_MEQ76_SB03 62 deg F NaN 16.666666666666686 degree_Celsius degF
11316 11NPSWRD_WQX-SAMA_MEQ76_SB03 60 deg F NaN 15.5555555555556 degree_Celsius degF
... ... ... ... ... ... ...
576586 11NPSWRD_WQX-SAIR_DMF11_HS1 56 deg F NaN 13.333333333333371 degree_Celsius degF
576587 11NPSWRD_WQX-SAIR_DMF11_OS3 62 deg F NaN 16.666666666666686 degree_Celsius degF
576588 11NPSWRD_WQX-SAIR_DMF11_OS4 61 deg F NaN 16.111111111111143 degree_Celsius degF
576594 11NPSWRD_WQX-SAIR_DMF11_S2 43 deg F NaN 6.111111111111143 degree_Celsius degF
576595 11NPSWRD_WQX-SAIR_DMF11_OS1 59 deg F NaN 15.000000000000057 degree_Celsius degF

695 rows × 6 columns

In the above we can see examples where the results were in deg F and in the result field they’ve been converted into degree_Celsius

[35]:
# Examine missing units
temperature_results.loc[df['ResultMeasure/MeasureUnitCode'].isna()]
[35]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Temperature Units
28268 USGS-414654070002901 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN degC
383203 11113300-HHPS056 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN degC
383209 11113300-HHPS055 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN degC
383213 11113300-HHPS061 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN degC
461985 AQUINNAH-MEN PND HC 0 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 0.0 degree_Celsius degC
... ... ... ... ... ... ...
462721 AQUINNAH-SQUIB # 33 0 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 0.0 degree_Celsius degC
462722 AQUINNAH-SQUIB # 33 0 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 0.0 degree_Celsius degC
462723 AQUINNAH-SQUIB # 33 0 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 0.0 degree_Celsius degC
462734 AQUINNAH-SQUIB # 33 0 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 0.0 degree_Celsius degC
470083 AQUINNAH-MEN PND HC 0 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 0.0 degree_Celsius degC

221 rows × 6 columns

We can see where the units were missing, the results were assumed to be in degree_Celsius already

[36]:
# This is also noted in the QA_flag field
list(temperature_results.loc[df['ResultMeasure/MeasureUnitCode'].isna(), 'QA_flag'])[0]
[36]:
'ResultMeasureValue: missing (NaN) result; ResultMeasure/MeasureUnitCode: MISSING UNITS, degC assumed'
[37]:
# Look for any without usable results
temperature_results.loc[df['Temperature'].isna()]
[37]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Temperature Units
28268 USGS-414654070002901 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN degC
80417 11NPSWRD_WQX-SAIR_SWC_SR3 NaN deg C ResultMeasureValue: missing (NaN) result NaN degC
104811 11113300-HHPS073 NO DATA deg C ResultMeasureValue: "NO DATA" result cannot be... NaN degC
108592 11113300-HHPS058 NO DATA deg C ResultMeasureValue: "NO DATA" result cannot be... NaN degC
108641 11113300-HHPS071 NO DATA deg C ResultMeasureValue: "NO DATA" result cannot be... NaN degC
... ... ... ... ... ... ...
435657 11113300-HHPS072 NO DATA deg C ResultMeasureValue: "NO DATA" result cannot be... NaN degC
435687 11113300-HHPS073 NO DATA deg C ResultMeasureValue: "NO DATA" result cannot be... NaN degC
523147 11NPSWRD_WQX-SAIR_SWC_SR6 NaN deg C ResultMeasureValue: missing (NaN) result NaN degC
523872 11NPSWRD_WQX-SAIR_SWC_SR7 NaN deg C ResultMeasureValue: missing (NaN) result NaN degC
526312 11NPSWRD_WQX-SAIR_SWC_SRT9 NaN deg C ResultMeasureValue: missing (NaN) result NaN degC

226 rows × 6 columns

[38]:
# Aggregate temperature data by station
visualize.station_summary(temperature_results, 'Temperature')
[38]:
MonitoringLocationIdentifier cnt mean
0 11113300-00-SPB 21 10.909524
1 11113300-00F-KLY 3 20.233333
2 11113300-01-BAK 10 20.600000
3 11113300-01-BVR 1 21.000000
4 11113300-01-CTP 39 24.438462
... ... ... ...
6563 WWMD_VA-WWE1149PI 1 18.800000
6564 WWMD_VA-WWE1150PI 1 21.300000
6565 WWMD_VA-WWE1151PI 2 20.700000
6566 WWMD_VA-WWE1152PI 1 20.000000
6567 WWMD_VA-WWE1153PI 1 20.800000

6568 rows × 3 columns

[39]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(temperature_results, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
[39]:
<Axes: >
../_images/notebooks_Harmonize_CapeCod_Detailed_55_1.png
[40]:
# Map average results at each station
gdf_avg = visualize.map_measure(temperature_results, stations_clipped, 'Temperature')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[40]:
<Axes: >
../_images/notebooks_Harmonize_CapeCod_Detailed_56_1.png

Dissolved Oxygen (DO)

[41]:
# look at Dissolved oxygen (DO), but this time without intermediate fields
df = harmonize.harmonize(df, 'Dissolved oxygen (DO)')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(8.0, 'milligram / liter')> <Quantity(8.0, 'milligram / liter')>
 <Quantity(10.0, 'milligram / liter')> ...
 <Quantity(10.0, 'milligram / liter')>
 <Quantity(10.0, 'milligram / liter')>
 <Quantity(9.0, 'milligram / liter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)

Note: Imediately when we run a harmonization function without the intermediate fields they’re deleted.

[42]:
# Look at what was changed
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'DO']
do_res = df.loc[df['CharacteristicName']=='Dissolved oxygen (DO)', cols]
do_res
[42]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag DO
2617 11NPSWRD_WQX-SAMA_EPG_SH13 8.0 mg/L NaN 8.0 milligram / liter
2624 11NPSWRD_WQX-SAMA_EPG_SH18 8.0 mg/L NaN 8.0 milligram / liter
2627 11NPSWRD_WQX-SAMA_EPG_SH3 10.0 mg/L NaN 10.0 milligram / liter
2629 11NPSWRD_WQX-SAMA_EPG_SH1 9.0 mg/L NaN 9.0 milligram / liter
2635 11NPSWRD_WQX-SAMA_EPG_SH7 4.0 mg/L NaN 4.0 milligram / liter
... ... ... ... ... ...
576585 11NPSWRD_WQX-SAIR_DMF11_S1 8 mg/L NaN 8.0 milligram / liter
576589 11NPSWRD_WQX-SAIR_DMF11_S3 10 mg/L NaN 10.0 milligram / liter
576590 11NPSWRD_WQX-SAIR_DMF11_S4 10 mg/L NaN 10.0 milligram / liter
576592 11NPSWRD_WQX-SAIR_DMF11_S5 10 mg/L NaN 10.0 milligram / liter
576597 11NPSWRD_WQX-SAIR_DMF11_S3 9 mg/L NaN 9.0 milligram / liter

95156 rows × 5 columns

[43]:
do_res.loc[do_res['ResultMeasure/MeasureUnitCode']!='mg/l']
[43]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag DO
2617 11NPSWRD_WQX-SAMA_EPG_SH13 8.0 mg/L NaN 8.0 milligram / liter
2624 11NPSWRD_WQX-SAMA_EPG_SH18 8.0 mg/L NaN 8.0 milligram / liter
2627 11NPSWRD_WQX-SAMA_EPG_SH3 10.0 mg/L NaN 10.0 milligram / liter
2629 11NPSWRD_WQX-SAMA_EPG_SH1 9.0 mg/L NaN 9.0 milligram / liter
2635 11NPSWRD_WQX-SAMA_EPG_SH7 4.0 mg/L NaN 4.0 milligram / liter
... ... ... ... ... ...
576585 11NPSWRD_WQX-SAIR_DMF11_S1 8 mg/L NaN 8.0 milligram / liter
576589 11NPSWRD_WQX-SAIR_DMF11_S3 10 mg/L NaN 10.0 milligram / liter
576590 11NPSWRD_WQX-SAIR_DMF11_S4 10 mg/L NaN 10.0 milligram / liter
576592 11NPSWRD_WQX-SAIR_DMF11_S5 10 mg/L NaN 10.0 milligram / liter
576597 11NPSWRD_WQX-SAIR_DMF11_S3 9 mg/L NaN 9.0 milligram / liter

93859 rows × 5 columns

Though there were no results in %, the conversion from percent saturation (%) to mg/l is special. This equation is being improved by integrating tempertaure and pressure instead of assuming STP (see DO_saturation())

[44]:
# Aggregate data by station
visualize.station_summary(do_res, 'DO')
[44]:
MonitoringLocationIdentifier cnt mean
0 11113300-00F-KLY 3 4.040000
1 11113300-01-BAK 10 5.697000
2 11113300-01-BVR 1 6.700000
3 11113300-01-CTP 37 6.945135
4 11113300-01-GOL 1 4.400000
... ... ... ...
2875 WWMD_VA-SR5 1 10.810000
2876 WWMD_VA-WF2 2 6.800000
2877 WWMD_VA-WI1 1 9.510000
2878 WWMD_VA-WR2X 2 6.000000
2879 WWMD_VA-WR5 1 8.355000

2880 rows × 3 columns

[45]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(do_res, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
[45]:
<Axes: >
../_images/notebooks_Harmonize_CapeCod_Detailed_64_1.png
[46]:
# Map average results at each station
gdf_avg = visualize.map_measure(do_res, stations_clipped, 'DO')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[46]:
<Axes: >
../_images/notebooks_Harmonize_CapeCod_Detailed_65_1.png

pH

[47]:
# pH, this time looking at a report
df = harmonize.harmonize(df, "pH", errors="ignore", report=True)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'mV' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(6.7, 'dimensionless')> <Quantity(7.1, 'dimensionless')>
 <Quantity(6.3, 'dimensionless')> ... <Quantity(6.6, 'dimensionless')>
 <Quantity(6.8, 'dimensionless')> <Quantity(7.1, 'dimensionless')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count    107912.000000
mean          6.683628
std           4.078693
min          -3.124705
25%           5.870000
50%           6.800000
75%           7.670000
max         788.000000
dtype: float64
Unusable results: 4814
Usable results with inferred units: 86358
Results outside threshold (0.0 to 31.15578654092979): 14
../_images/notebooks_Harmonize_CapeCod_Detailed_67_2.png

Note the warnings that occur when a unit is not recognized by the package. These occur even when report=False. Future versions could include these as defined units for pH, but here it wouldn’t alter results.

[48]:
df.loc[df['CharacteristicName']=='pH', ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'pH']]
[48]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag pH
1 6.7 std units NaN 6.7 dimensionless
3 7.1 std units NaN 7.1 dimensionless
4 6.3 std units NaN 6.3 dimensionless
6 5.7 std units NaN 5.7 dimensionless
7 6.9 std units NaN 6.9 dimensionless
... ... ... ... ...
577119 6.6 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 6.6 dimensionless
577126 6.9 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 6.9 dimensionless
577150 6.6 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 6.6 dimensionless
577165 6.8 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 6.8 dimensionless
577192 7.1 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 7.1 dimensionless

112726 rows × 4 columns

‘None’ is uninterpretable and replaced with NaN, which then gets replaced with ‘dimensionless’ since pH is unitless

Salinity

[49]:
# Salinity
df = harmonize.harmonize(df, 'Salinity', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/basis.py:154: UserWarning: Mismatched ResultTemperatureBasisText: updated from 25 deg C to @25C (units)
  warn(f"Mismatched {flag}", UserWarning)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'deg C' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(31.0, 'Practical_Salinity_Units')>
 <Quantity(27.0, 'Practical_Salinity_Units')>
 <Quantity(33.0, 'Practical_Salinity_Units')> ...
 <Quantity(31.0, 'Practical_Salinity_Units')>
 <Quantity(32.0, 'Practical_Salinity_Units')>
 <Quantity(27.0, 'Practical_Salinity_Units')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count    61120.000000
mean        23.209709
std         20.530382
min        -30.000000
25%         13.160000
50%         30.000000
75%         31.500000
max       4003.482834
dtype: float64
Unusable results: 228
Usable results with inferred units: 1
Results outside threshold (0.0 to 146.39200087840806): 3
../_images/notebooks_Harmonize_CapeCod_Detailed_72_2.png
[50]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity']
df.loc[df['CharacteristicName']=='Salinity', cols]
[50]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Salinity
2620 31.0 ppth NaN 31.0 Practical_Salinity_Units
2623 27.0 ppth NaN 27.0 Practical_Salinity_Units
2625 33.0 ppth NaN 33.0 Practical_Salinity_Units
2631 30.0 ppth NaN 30.0 Practical_Salinity_Units
2633 32.5 ppth NaN 32.5 Practical_Salinity_Units
... ... ... ... ...
576576 32.5 ppth NaN 32.5 Practical_Salinity_Units
576584 31.5 ppth NaN 31.5 Practical_Salinity_Units
576591 31.0 ppth NaN 31.0 Practical_Salinity_Units
576593 32.0 ppth NaN 32.0 Practical_Salinity_Units
576596 27.0 ppth NaN 27.0 Practical_Salinity_Units

61348 rows × 4 columns

Nitrogen

[51]:
# Nitrogen
df = harmonize.harmonize(df, 'Nitrogen', report=True)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/basis.py:343: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'as N' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[mask, basis_col] = basis
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:484: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '['as N' 'as N' 'as N' ... nan nan nan]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  self.df[c_mask] = basis.basis_from_method_spec(self.df[c_mask])
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'cm3/g' UNDEFINED UNIT for Nitrogen
  warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/pandas/core/construction.py:627: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  data = np.asarray(data)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/pandas/core/construction.py:627: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  data = np.asarray(data)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(1900.0, 'milligram / liter')>
 <Quantity(2800.0, 'milligram / liter')>
 <Quantity(1100.0, 'milligram / liter')> ...
 <Quantity(1.7892171, 'milligram / liter')>
 <Quantity(0.2243001, 'milligram / liter')>
 <Quantity(0.1150221, 'milligram / liter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/domains.py:277: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  sub_df[cols[2]] = sub_df[cols[2]].fillna(sub_df[cols[1]])  # new_fract
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/domains.py:277: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  sub_df[cols[2]] = sub_df[cols[2]].fillna(sub_df[cols[1]])  # new_fract
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/domains.py:277: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  sub_df[cols[2]] = sub_df[cols[2]].fillna(sub_df[cols[1]])  # new_fract
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/domains.py:277: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  sub_df[cols[2]] = sub_df[cols[2]].fillna(sub_df[cols[1]])  # new_fract
-Usable results-
count    2508.000000
mean        6.264310
std       104.802798
min         0.000800
25%         0.078000
50%         0.264375
75%         1.030000
max      2800.000000
dtype: float64
Unusable results: 243
Usable results with inferred units: 0
Results outside threshold (0.0 to 635.0810968269222): 6
../_images/notebooks_Harmonize_CapeCod_Detailed_75_2.png
[52]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Nitrogen']
df.loc[df['CharacteristicName']=='Nitrogen', cols]
[52]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Nitrogen
17836 .19 % NaN 1900.0 milligram / liter
17837 .28 % NaN 2800.0000000000005 milligram / liter
17838 NaN % ResultMeasureValue: missing (NaN) result NaN
17848 .11 % NaN 1100.0 milligram / liter
17854 .28 % NaN 2800.0000000000005 milligram / liter
... ... ... ... ...
526271 109.59 umol NaN 1.5353559 milligram / liter
526424 25.06 umol NaN 0.3510906 milligram / liter
526436 127.71 umol NaN 1.7892171 milligram / liter
526483 16.01 umol NaN 0.2243001 milligram / liter
526484 8.21 umol NaN 0.1150221 milligram / liter

2751 rows × 4 columns

Conductivity

[53]:
# Conductivity
df = harmonize.harmonize(df, 'Conductivity', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'count' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(327.0, 'microsiemens / centimeter')>
 <Quantity(127.0, 'microsiemens / centimeter')>
 <Quantity(345.0, 'microsiemens / centimeter')> ...
 <Quantity(433.0, 'microsiemens / centimeter')>
 <Quantity(444.0, 'microsiemens / centimeter')>
 <Quantity(424.0, 'microsiemens / centimeter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count     13563.000000
mean      27191.912855
std       18509.783617
min           0.000000
25%       16480.300000
50%       30669.300000
75%       41824.300000
max      992000.000000
dtype: float64
Unusable results: 82
Usable results with inferred units: 0
Results outside threshold (0.0 to 138250.6145569042): 3
../_images/notebooks_Harmonize_CapeCod_Detailed_78_2.png
[54]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Conductivity']
df.loc[df['CharacteristicName']=='Conductivity', cols]
[54]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Conductivity
96800 327 uS/cm NaN 327.0 microsiemens / centimeter
96803 127 uS/cm NaN 127.0 microsiemens / centimeter
96972 345 uS/cm NaN 345.0 microsiemens / centimeter
96974 348 uS/cm NaN 348.0 microsiemens / centimeter
96977 344 uS/cm NaN 344.0 microsiemens / centimeter
... ... ... ... ...
482620 610 uS/cm NaN 610.0 microsiemens / centimeter
482724 150 uS/cm NaN 150.0 microsiemens / centimeter
483348 433 uS/cm NaN 433.0 microsiemens / centimeter
484074 444 uS/cm NaN 444.0 microsiemens / centimeter
484703 424 uS/cm NaN 424.0 microsiemens / centimeter

13645 rows × 4 columns

Chlorophyll a

[55]:
# Chlorophyll a
df = harmonize.harmonize(df, 'Chlorophyll a', report=True)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'ppb' UNDEFINED UNIT for Chlorophyll
  warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'ug/m3' UNDEFINED UNIT for Chlorophyll
  warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'ug/cm2' UNDEFINED UNIT for Chlorophyll
  warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(0.0027, 'milligram / liter')>
 <Quantity(0.0017, 'milligram / liter')>
 <Quantity(0.000233, 'milligram / liter')> ...
 <Quantity(1.46, 'milligram / liter')>
 <Quantity(0.04, 'milligram / liter')>
 <Quantity(4.22, 'milligram / liter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count    14059.00000
mean         0.17631
std          1.80612
min         -0.00240
25%          0.00180
50%          0.00350
75%          0.00730
max         92.90000
dtype: float64
Unusable results: 204
Usable results with inferred units: 9
Results outside threshold (0.0 to 11.01303017093192): 64
../_images/notebooks_Harmonize_CapeCod_Detailed_81_2.png
[56]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Chlorophyll']
df.loc[df['CharacteristicName']=='Chlorophyll a', cols]
[56]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Chlorophyll
12043 2.70 mg/m2 NaN 0.002700000000000001 milligram / liter
12044 1.70 mg/m2 NaN 0.0017000000000000003 milligram / liter
12057 0.233 mg/m2 NaN 0.00023300000000000008 milligram / liter
12064 0.200 mg/m2 NaN 0.00020000000000000006 milligram / liter
12084 0.912 mg/m2 NaN 0.0009120000000000003 milligram / liter
... ... ... ... ...
556627 7.28 ug/m3 ResultMeasure/MeasureUnitCode: 'ug/m3' UNDEFIN... 7.28 milligram / liter
556628 12.51 ug/m3 ResultMeasure/MeasureUnitCode: 'ug/m3' UNDEFIN... 12.51 milligram / liter
556629 1.46 ug/m3 ResultMeasure/MeasureUnitCode: 'ug/m3' UNDEFIN... 1.46 milligram / liter
556630 0.04 ug/m3 ResultMeasure/MeasureUnitCode: 'ug/m3' UNDEFIN... 0.04 milligram / liter
556631 4.22 ug/m3 ResultMeasure/MeasureUnitCode: 'ug/m3' UNDEFIN... 4.22 milligram / liter

14263 rows × 4 columns

Organic Carbon

[57]:
# Organic carbon (%)
df = harmonize.harmonize(df, 'Organic carbon', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'umol/L * H2O' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(5.5, 'milligram / liter')> <Quantity(8.5, 'milligram / liter')>
 <Quantity(7.0, 'milligram / liter')> ...
 <Quantity(9.3, 'milligram / liter')> <Quantity(0.2, 'milligram / liter')>
 <Quantity(0.4, 'milligram / liter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count      6196.000000
mean       5240.393408
std       35437.445450
min      -90000.000000
25%           1.800000
50%           4.900000
75%           8.872500
max      530000.000000
dtype: float64
Unusable results: 89
Usable results with inferred units: 0
Results outside threshold (0.0 to 217865.06610661917): 40
../_images/notebooks_Harmonize_CapeCod_Detailed_84_2.png
[58]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Carbon']
df.loc[df['CharacteristicName']=='Organic carbon', cols]
[58]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Carbon
6538 5.5 mg/l NaN 5.5 milligram / liter
7245 8.5 mg/l NaN 8.5 milligram / liter
8296 7.0 mg/l NaN 7.0 milligram / liter
8591 4.3 mg/l NaN 4.3 milligram / liter
8754 4.9 mg/l NaN 4.9 milligram / liter
... ... ... ... ...
563970 63900 ug/g NaN 63900.000000000015 milligram / liter
564619 15200 ug/g NaN 15200.000000000004 milligram / liter
570862 9.3 mg/l NaN 9.3 milligram / liter
575466 0.2 mg/l NaN 0.2 milligram / liter
575467 0.4 mg/l NaN 0.4 milligram / liter

6285 rows × 4 columns

Turbidity (NTU)

[59]:
# Turbidity (NTU)
df = harmonize.harmonize(df, 'Turbidity', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/harmonize.py:149: UserWarning: Bad Turbidity unit: count
  warn(f"Bad Turbidity unit: {unit}")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'count' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(7.5701, 'Nephelometric_Turbidity_Units')>
 <Quantity(18.9773, 'Nephelometric_Turbidity_Units')>
 <Quantity(570.7023, 'Nephelometric_Turbidity_Units')> ...
 <Quantity(1.9, 'Nephelometric_Turbidity_Units')>
 <Quantity(1.7, 'Nephelometric_Turbidity_Units')>
 <Quantity(2.0, 'Nephelometric_Turbidity_Units')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count    34827.000000
mean        15.519654
std         71.703547
min       -999.000000
25%          1.250000
50%          2.500000
75%          5.700000
max       4100.000000
dtype: float64
Unusable results: 3418
Usable results with inferred units: 275
Results outside threshold (0.0 to 445.740934981733): 215
../_images/notebooks_Harmonize_CapeCod_Detailed_87_2.png
[60]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Turbidity']
df.loc[df['CharacteristicName']=='Turbidity', cols]
[60]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Turbidity
290 1.0 mg/l SiO2 NaN 7.5701 Nephelometric_Turbidity_Units
523 1 JTU NaN 18.9773 Nephelometric_Turbidity_Units
531 30 JTU NaN 570.7023 Nephelometric_Turbidity_Units
535 2 JTU NaN 38.0023 Nephelometric_Turbidity_Units
540 7 JTU NaN 133.1273 Nephelometric_Turbidity_Units
... ... ... ... ...
567067 4.0 mg/l SiO2 NaN 30.378500000000003 Nephelometric_Turbidity_Units
573181 2.3 NTU NaN 2.3 Nephelometric_Turbidity_Units
575929 1.9 NTU NaN 1.9 Nephelometric_Turbidity_Units
575977 1.7 NTU NaN 1.7 Nephelometric_Turbidity_Units
575982 2 NTU NaN 2.0 Nephelometric_Turbidity_Units

38245 rows × 4 columns

Sediment

[61]:
# Sediment
df = harmonize.harmonize(df, 'Sediment', report=False, errors='ignore')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'g / H2O' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:134: FutureWarning: The behavior of array concatenation with empty entries is deprecated. In a future version, this will no longer exclude empty items when determining the result dtype. To retain the old behavior, exclude the empty entries before the concat operation.
  return pandas.concat(lst_series).sort_index()
[62]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Sediment']
df.loc[df['CharacteristicName']=='Sediment', cols]
[62]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Sediment
246472 0.0024 g NaN NaN
246484 0.0034 g NaN NaN
246497 0.0016 g NaN NaN
246506 0.0046 g NaN NaN
246513 0.0016 g NaN NaN
... ... ... ... ...
365519 0.0049 g NaN NaN
368318 0.0070 g NaN NaN
368593 0.0093 g NaN NaN
371133 0.0020 g NaN NaN
371227 0.0049 g NaN NaN

4410 rows × 4 columns

Phosphorus

Note: must be merged w/ activities (package runs query by site if not already merged)

[63]:
# Phosphorus
df = harmonize.harmonize(df, 'Phosphorus', errors='ignore')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/pandas/core/construction.py:627: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  data = np.asarray(data)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/pandas/core/construction.py:627: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  data = np.asarray(data)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/pandas/core/construction.py:627: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  data = np.asarray(data)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'umol/L * H2O' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(0.56, 'milligram / liter')>
 <Quantity(0.89, 'milligram / liter')>
 <Quantity(1.7, 'milligram / liter')> ...
 <Quantity(0.01, 'milligram / liter')>
 <Quantity(0.011, 'milligram / liter')>
 <Quantity(0.007, 'milligram / liter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
2 Phosphorus sample fractions not in frac_dict
2 Phosphorus sample fractions not in frac_dict found in expected domains, mapped to "Other_Phosphorus"

Note: warnings for unexpected characteristic fractions. Fractions are each seperated out into their own result column.

[64]:
# All Phosphorus
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'TDP_Phosphorus']
df.loc[df['Phosphorus'].notna(), cols]
[64]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag TDP_Phosphorus
1194 0.56 mg/l PO4 NaN NaN
1290 0.89 mg/l PO4 NaN NaN
1310 1.7 mg/l PO4 NaN NaN
1355 0.28 mg/l PO4 NaN NaN
1400 0.07 mg/l PO4 NaN NaN
... ... ... ... ...
575998 0.024 mg/L NaN NaN
576642 0.01 mg/L NaN NaN
576644 0.01 mg/L NaN NaN
576738 0.011 mg/L NaN NaN
576768 0.007 mg/L NaN NaN

19507 rows × 4 columns

[65]:
# Total phosphorus
df.loc[df['TP_Phosphorus'].notna(), cols]
[65]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag TDP_Phosphorus
1194 0.56 mg/l PO4 NaN NaN
1290 0.89 mg/l PO4 NaN NaN
1310 1.7 mg/l PO4 NaN NaN
1355 0.28 mg/l PO4 NaN NaN
1400 0.07 mg/l PO4 NaN NaN
... ... ... ... ...
575998 0.024 mg/L NaN NaN
576642 0.01 mg/L NaN NaN
576644 0.01 mg/L NaN NaN
576738 0.011 mg/L NaN NaN
576768 0.007 mg/L NaN NaN

14234 rows × 4 columns

[66]:
# Total dissolved phosphorus
df.loc[df['TDP_Phosphorus'].notna(), cols]
[66]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag TDP_Phosphorus
1483 0.080 mg/l as P NaN 0.08 milligram / liter
1488 0.040 mg/l as P NaN 0.04 milligram / liter
1493 0.290 mg/l as P NaN 0.29 milligram / liter
1499 0.290 mg/l as P NaN 0.29 milligram / liter
1503 0.110 mg/l as P NaN 0.11 milligram / liter
... ... ... ... ...
493033 0.03 mg/L NaN 0.03 milligram / liter
493221 0.05 mg/L NaN 0.05 milligram / liter
493229 0.02 mg/L NaN 0.02 milligram / liter
493296 0.03 mg/L NaN 0.03 milligram / liter
575465 0.07 mg/l as P NaN 0.07 milligram / liter

4857 rows × 4 columns

[67]:
# All other phosphorus sample fractions
df.loc[df['Other_Phosphorus'].notna(), cols]
[67]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag TDP_Phosphorus
27579 140 mg/kg as P NaN NaN
27580 570 mg/kg as P NaN NaN
27583 100 mg/kg as P NaN NaN
27584 630 mg/kg as P NaN NaN
27586 190 mg/kg as P NaN NaN
... ... ... ... ...
571217 1300 mg/kg as P NaN NaN
571218 800 mg/kg as P NaN NaN
571219 2100 mg/kg as P NaN NaN
571220 1900 mg/kg as P NaN NaN
571221 1700 mg/kg as P NaN NaN

416 rows × 4 columns

Bacteria

Some equivalence assumptions are built-in where bacteria counts that are not equivalent are treated as such because there is no standard way to convert from one to another.

Fecal Coliform

[68]:
# Known unit with bad dimensionality ('Colony_Forming_Units * milliliter')
df = harmonize.harmonize(df, 'Fecal Coliform', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(1.0, 'Colony_Forming_Units / milliliter')>
 <Quantity(4200.0, 'Colony_Forming_Units / milliliter')>
 <Quantity(410.0, 'Colony_Forming_Units / milliliter')> ... nan nan nan]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count      2462.000000
mean        910.512591
std        6103.365912
min           0.000000
25%          24.000000
50%          93.000000
75%         400.000000
max      250000.000000
dtype: float64
Unusable results: 5815
Usable results with inferred units: 1
Results outside threshold (0.0 to 37530.70806108734): 7
../_images/notebooks_Harmonize_CapeCod_Detailed_103_2.png
[69]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Fecal_Coliform']
df.loc[df['CharacteristicName']=='Fecal Coliform', cols]
[69]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Fecal_Coliform
2 1 cfu/100ml NaN 1.0 Colony_Forming_Units / milliliter
1463 4200 cfu/100ml NaN 4200.0 Colony_Forming_Units / milliliter
1538 410 cfu/100ml NaN 410.0 Colony_Forming_Units / milliliter
1545 10000 cfu/100ml NaN 10000.0 Colony_Forming_Units / milliliter
1547 10000 cfu/100ml NaN 10000.0 Colony_Forming_Units / milliliter
... ... ... ... ...
575994 400 #/100mL NaN NaN
575995 300 #/100mL NaN NaN
575996 24000 #/100mL NaN NaN
576797 180 #/100mL NaN NaN
577117 NaN #/100mL ResultMeasureValue: missing (NaN) result NaN

8277 rows × 4 columns

Excherichia Coli

[70]:
# Known unit with bad dimensionality ('Colony_Forming_Units * milliliter')
df = harmonize.harmonize(df, 'Escherichia coli', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'CFUcol/100mL' UNDEFINED UNIT for E_coli
  warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'count' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: '%' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[nan nan nan ... nan nan nan]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count     5190.000000
mean       245.858339
std       1752.146999
min          0.000000
25%          0.500000
50%         30.000000
75%        127.750000
max      72000.000000
dtype: float64
Unusable results: 27840
Usable results with inferred units: 0
Results outside threshold (0.0 to 10758.740332329791): 14
../_images/notebooks_Harmonize_CapeCod_Detailed_106_2.png
[71]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'E_coli']
df.loc[df['CharacteristicName']=='Escherichia coli', cols]
[71]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag E_coli
28967 930 MPN/100mL NaN NaN
29097 430 MPN/100mL NaN NaN
30225 430 MPN/100mL NaN NaN
32346 40 MPN/100mL NaN NaN
32355 90 MPN/100mL NaN NaN
... ... ... ... ...
548468 89 #/100mL NaN NaN
548473 59 #/100mL NaN NaN
548558 1 #/100mL NaN NaN
548697 17 #/100mL NaN NaN
548823 5 #/100mL NaN NaN

33030 rows × 4 columns

Combining Salinity and Conductivity

Convert module has various functions to convert from one unit or characteristic to another. Some of these are used within a single characteristic during harmonization (e.g. DO saturation to concentration) while others are intended to model one characteristic as an indicator of another (e.g. estimate salinity from conductivity).

Note: this should only be done after both characteristic fields have been harmonized. Results before and after should be inspected, thresholds for outliers applied, and consider adding a QA_flag for modeled data.

Explore Salinity results:

[72]:
from harmonize_wq import convert
[73]:
# First note initial Salinity info
lst = [x.magnitude for x in list(df['Salinity'].dropna())]
q_sum = sum(lst)
print('Range: {} to {}'.format(min(lst), max(lst)))
print('Results: {} \nMean: {} PSU'.format(len(lst), q_sum/len(lst)))
Range: -30.0 to 4003.4828342857154
Results: 61120
Mean: 23.20970929655375 PSU
[74]:
# Identify extreme outliers
[x for x in lst if x >3200]
[74]:
[4003.4828342857154]

Other fields like units and QA_flag may help understand what caused high values and what results might need to be dropped from consideration

[75]:
# Columns to focus on
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity']
[76]:
# Look at important fields for max 5 values
salinity_series = df['Salinity'][df['Salinity'].notna()]
salinity_series.sort_values(ascending=False, inplace=True)
df[cols][df['Salinity'].isin(salinity_series[0:5])]
[76]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Salinity
65460 4980 mg/mL @25C ResultTemperatureBasisText: updated from 25 de... 4003.4828342857154 Practical_Salinity_Units
163496 804 ppth NaN 804.0 Practical_Salinity_Units
301168 77.6666666666667 ppt NaN 77.6666666666667 Practical_Salinity_Units
470190 70.62 ppth NaN 70.62 Practical_Salinity_Units
470260 71.49 ppth NaN 71.49 Practical_Salinity_Units

Detection limits may help understand what caused low values and what results might need to be dropped or updated

[77]:
df = wrangle.add_detection(df, 'Salinity')
cols+=['ResultDetectionConditionText',
       'DetectionQuantitationLimitTypeName',
       'DetectionQuantitationLimitMeasure/MeasureValue',
       'DetectionQuantitationLimitMeasure/MeasureUnitCode']
[78]:
# Look at important fields for min 5 values (often multiple 0.0)
df[cols][df['Salinity'].isin(salinity_series[-5:])]
[78]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Salinity ResultDetectionConditionText DetectionQuantitationLimitTypeName DetectionQuantitationLimitMeasure/MeasureValue DetectionQuantitationLimitMeasure/MeasureUnitCode
23795 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
23797 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
23803 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
23825 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
23829 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
542778 0 ppt NaN 0.0 Practical_Salinity_Units NaN Reporting limit 0.0 ppt
543228 0 ppt NaN 0.0 Practical_Salinity_Units NaN Reporting limit 0.0 ppt
544507 0 ppt NaN 0.0 Practical_Salinity_Units NaN Reporting limit 0.0 ppt
544730 0 ppt NaN 0.0 Practical_Salinity_Units NaN Reporting limit 0.0 ppt
545537 0 ppt NaN 0.0 Practical_Salinity_Units NaN Reporting limit 0.0 ppt

876 rows × 8 columns

Explore conductivity results:

[79]:
# Create series for Conductivity values
cond_series = df['Conductivity'].dropna()
cond_series
[79]:
96800     327.0 microsiemens / centimeter
96803     127.0 microsiemens / centimeter
96972     345.0 microsiemens / centimeter
96974     348.0 microsiemens / centimeter
96977     344.0 microsiemens / centimeter
                       ...
482620    610.0 microsiemens / centimeter
482724    150.0 microsiemens / centimeter
483348    433.0 microsiemens / centimeter
484074    444.0 microsiemens / centimeter
484703    424.0 microsiemens / centimeter
Name: Conductivity, Length: 13563, dtype: object

Conductivity thresholds from Freshwater Explorer: 10 > x < 5000 us/cm, use a higher threshold for coastal waters

[80]:
# Sort and check other relevant columns before converting (e.g. Salinity)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity', 'Conductivity']
df.sort_values(by=['Conductivity'], ascending=False, inplace=True)
df.loc[df['Conductivity'].notna(), cols]
[80]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Salinity Conductivity
386985 992 mS/cm NaN NaN 992000.0 microsiemens / centimeter
386996 270.2 mS/cm NaN NaN 270200.0 microsiemens / centimeter
387616 241.1 mS/cm NaN NaN 241100.0 microsiemens / centimeter
474268 57700 uS/cm NaN NaN 57700.0 microsiemens / centimeter
347883 52.418 mS/cm NaN NaN 52418.0 microsiemens / centimeter
... ... ... ... ... ...
182694 0 uS/cm NaN NaN 0.0 microsiemens / centimeter
382645 0 uS/cm NaN NaN 0.0 microsiemens / centimeter
380818 0 uS/cm NaN NaN 0.0 microsiemens / centimeter
380050 0 uS/cm NaN NaN 0.0 microsiemens / centimeter
182683 0 uS/cm NaN NaN 0.0 microsiemens / centimeter

13563 rows × 5 columns

[81]:
# Convert values to PSU and write to Salinity
cond_series = cond_series.apply(str)  # Convert to string to convert to dimensionless (PSU)
df.loc[df['Conductivity'].notna(), 'Salinity'] = cond_series.apply(convert.conductivity_to_PSU)
df.loc[df['Conductivity'].notna(), 'Salinity']
[81]:
386985    3226.305 dimensionless
386996      269.94 dimensionless
387616      227.97 dimensionless
474268      38.468 dimensionless
347883      34.521 dimensionless
                   ...
182694       0.012 dimensionless
382645       0.012 dimensionless
380818       0.012 dimensionless
380050       0.012 dimensionless
182683       0.012 dimensionless
Name: Salinity, Length: 13563, dtype: object

Datetime

datetime() formats time using dataretrieval and ActivityStart

[82]:
# First inspect the existing unformated fields
cols = ['ActivityStartDate', 'ActivityStartTime/Time', 'ActivityStartTime/TimeZoneCode']
df[cols]
[82]:
ActivityStartDate ActivityStartTime/Time ActivityStartTime/TimeZoneCode
386985 2025-08-13 10:29:00 EDT
386996 2025-08-13 10:55:00 EDT
387616 2025-09-02 09:36:00 EDT
474268 2021-06-21 NaN NaN
347883 2022-08-09 12:58:00 EST
... ... ... ...
577222 1956-08-16 NaN NaN
577223 1956-06-16 NaN NaN
577224 1956-07-01 NaN NaN
577225 1955-08-16 NaN NaN
577226 1952-08-16 NaN NaN

577227 rows × 3 columns

[83]:
# 'ActivityStartDate' presserves date where 'Activity_datetime' is NAT due to no time zone
df = clean.datetime(df)
df[['ActivityStartDate', 'Activity_datetime']]
[83]:
ActivityStartDate Activity_datetime
386985 2025-08-13 2025-08-13 14:29:00+00:00
386996 2025-08-13 2025-08-13 14:55:00+00:00
387616 2025-09-02 2025-09-02 13:36:00+00:00
474268 2021-06-21 NaT
347883 2022-08-09 2022-08-09 17:58:00+00:00
... ... ...
577222 1956-08-16 NaT
577223 1956-06-16 NaT
577224 1956-07-01 NaT
577225 1955-08-16 NaT
577226 1952-08-16 NaT

577227 rows × 2 columns

Activity_datetime combines all three time component columns into UTC. If time is missing this is NaT so a ActivityStartDate column is used to preserve date only.

Depth

Note: Data are often lacking sample depth metadata

[84]:
# Depth of sample (default units='meter')
df = clean.harmonize_depth(df)
#df.loc[df['ResultDepthHeightMeasure/MeasureValue'].dropna(), "Depth"]
df['ResultDepthHeightMeasure/MeasureValue'].dropna()
[84]:
125288    0.15
123998    0.15
115179    0.15
114053    0.15
124031    0.15
          ...
546084    0.15
546121    0.15
546124    0.15
546129    0.15
546131    0.15
Name: ResultDepthHeightMeasure/MeasureValue, Length: 24059, dtype: float64

Characteristic to Column (long to wide format)

[85]:
# Split single QA column into multiple by characteristic (rename the result to preserve these QA_flags)
df2 = wrangle.split_col(df)
df2
[85]:
OrganizationIdentifier OrganizationFormalName ActivityIdentifier ActivityStartDate ActivityStartTime/Time ActivityStartTime/TimeZoneCode MonitoringLocationIdentifier ResultIdentifier DataLoggerLine ResultDetectionConditionText ... QA_Fecal_Coliform QA_Sediment QA_TP_Phosphorus QA_TDP_Phosphorus QA_Other_Phosphorus QA_Secchi QA_DO QA_Temperature QA_Chlorophyll QA_Salinity
386985 SSCW Salem Sound Coastwatch (Volunteer) SSCW-552 (P3):20250813:1029:FM:WB:0.5 2025-08-13 10:29:00 EDT SSCW-552 (P3) STORET-1102951305 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
386996 SSCW Salem Sound Coastwatch (Volunteer) SSCW-P2:20250813:1055:FM:WB:0.5 2025-08-13 10:55:00 EDT SSCW-P2 STORET-1102951334 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
387616 SSCW Salem Sound Coastwatch (Volunteer) SSCW-P2:20250902:0936:FM:WB:0.5 2025-09-02 09:36:00 EDT SSCW-P2 STORET-1102951403 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
474268 NARS_WQX EPA National Aquatic Resources Survey (NARS) NARS_WQX-185450_2021 2021-06-21 NaN NaN NARS_WQX-NWC_RI-10033 STORET-1040684142 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
347883 AQUINNAH_WQX Wampanoag Tribe of Gay Head (Aquinnah) (Tribal) AQUINNAH_WQX-P3:202208091258:FM 2022-08-09 12:58:00 EST AQUINNAH_WQX-P3 STORET-1057585495 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
577221 11NPSWRD_WQX National Park Service Water Resources Division 11NPSWRD_WQX-CACO_DUCK_W_19570701_731710 1957-07-01 NaN NaN 11NPSWRD_WQX-CACO_DUCK_W STORET-740598867 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
577222 11NPSWRD_WQX National Park Service Water Resources Division 11NPSWRD_WQX-CACO_DUCK_W_19560816_731708 1956-08-16 NaN NaN 11NPSWRD_WQX-CACO_DUCK_W STORET-740598866 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
577223 11NPSWRD_WQX National Park Service Water Resources Division 11NPSWRD_WQX-CACO_DUCK_W_19560616_731704 1956-06-16 NaN NaN 11NPSWRD_WQX-CACO_DUCK_W STORET-740598862 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
577224 11NPSWRD_WQX National Park Service Water Resources Division 11NPSWRD_WQX-CACO_DUCK_W_19560701_731705 1956-07-01 NaN NaN 11NPSWRD_WQX-CACO_DUCK_W STORET-740598863 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
577225 11NPSWRD_WQX National Park Service Water Resources Division 11NPSWRD_WQX-CACO_GREAT_W_19550816_731703 1955-08-16 NaN NaN 11NPSWRD_WQX-CACO_GREAT_W STORET-740649462 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

522063 rows × 121 columns

[86]:
# This expands the single col (QA_flag) out to a number of new columns based on the unique characteristicNames and speciation
print('{} new columns'.format(len(df2.columns) - len(df.columns)))
15 new columns
[87]:
# Note: there are fewer rows because NAN results are also dropped in this step
print('{} fewer rows'.format(len(df)-len(df2)))
55164 fewer rows
[88]:
#Examine Carbon flags from earlier in notebook (note these are empty now because NAN is dropped)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'Carbon', 'QA_Carbon']
df2.loc[df2['QA_Carbon'].notna(), cols]
[88]:
ResultMeasureValue ResultMeasure/MeasureUnitCode Carbon QA_Carbon

Next the table is divided into the columns of interest (main_df) and characteristic specific metadata (chars_df)

[89]:
# split table into main and characteristics tables
main_df, chars_df = wrangle.split_table(df2)
[90]:
# Columns still in main table
main_df.columns
[90]:
Index(['OrganizationIdentifier', 'OrganizationFormalName',
       'ActivityIdentifier', 'MonitoringLocationIdentifier', 'ProviderName',
       'ActivityStartDateTime', 'AnalysisStartDateTime', 'AnalysisEndDateTime',
       'Secchi', 'Temperature', 'DO', 'pH', 'Salinity', 'Nitrogen',
       'Speciation', 'TOTAL NITROGEN_ MIXED FORMS', 'Conductivity',
       'Chlorophyll', 'Carbon', 'Turbidity', 'Sediment', 'Phosphorus',
       'TP_Phosphorus', 'TDP_Phosphorus', 'Other_Phosphorus', 'Fecal_Coliform',
       'E_coli', 'DetectionQuantitationLimitTypeName',
       'DetectionQuantitationLimitMeasure/MeasureValue',
       'DetectionQuantitationLimitMeasure/MeasureUnitCode',
       'Activity_datetime', 'Depth', 'QA_Nitrogen', 'QA_pH', 'QA_Turbidity',
       'QA_Conductivity', 'QA_Carbon', 'QA_E_coli', 'QA_Fecal_Coliform',
       'QA_Sediment', 'QA_TP_Phosphorus', 'QA_TDP_Phosphorus',
       'QA_Other_Phosphorus', 'QA_Secchi', 'QA_DO', 'QA_Temperature',
       'QA_Chlorophyll', 'QA_Salinity'],
      dtype='object')
[91]:
# look at main table results (first 5)
main_df.head()
[91]:
OrganizationIdentifier OrganizationFormalName ActivityIdentifier MonitoringLocationIdentifier ProviderName ActivityStartDateTime AnalysisStartDateTime AnalysisEndDateTime Secchi Temperature ... QA_Fecal_Coliform QA_Sediment QA_TP_Phosphorus QA_TDP_Phosphorus QA_Other_Phosphorus QA_Secchi QA_DO QA_Temperature QA_Chlorophyll QA_Salinity
386985 SSCW Salem Sound Coastwatch (Volunteer) SSCW-552 (P3):20250813:1029:FM:WB:0.5 SSCW-552 (P3) STORET 2025-08-13 14:29:00+00:00 NaT NaT NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
386996 SSCW Salem Sound Coastwatch (Volunteer) SSCW-P2:20250813:1055:FM:WB:0.5 SSCW-P2 STORET 2025-08-13 14:55:00+00:00 NaT NaT NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
387616 SSCW Salem Sound Coastwatch (Volunteer) SSCW-P2:20250902:0936:FM:WB:0.5 SSCW-P2 STORET 2025-09-02 13:36:00+00:00 NaT NaT NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
474268 NARS_WQX EPA National Aquatic Resources Survey (NARS) NARS_WQX-185450_2021 NARS_WQX-NWC_RI-10033 STORET NaT NaT NaT NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
347883 AQUINNAH_WQX Wampanoag Tribe of Gay Head (Aquinnah) (Tribal) AQUINNAH_WQX-P3:202208091258:FM AQUINNAH_WQX-P3 STORET 2022-08-09 17:58:00+00:00 NaT NaT NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 48 columns

[92]:
# Empty columns that could be dropped (Mostly QA columns)
cols = list(main_df.columns)
x = main_df.dropna(axis=1, how='all')
[col for col in cols if col not in x.columns]
[92]:
['AnalysisStartDateTime',
 'AnalysisEndDateTime',
 'Sediment',
 'QA_Conductivity',
 'QA_Carbon',
 'QA_Sediment',
 'QA_TDP_Phosphorus',
 'QA_DO']
[93]:
# Map average results at each station
gdf_avg = visualize.map_measure(main_df, stations_clipped, 'Temperature')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[93]:
<Axes: >
../_images/notebooks_Harmonize_CapeCod_Detailed_144_1.png