Tampa Bay, FL - Detailed step-by-step

Standardize, clean and wrangle Water Quality Portal data in Tampa Bay, FL into more analytic-ready formats using the harmonize_wq package

US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonixe_wq package is intended to be a flexible water quality specific framework to help:

  • Identify differences in data units (including speciation and basis)

  • Identify differences in sampling or analytic methods

  • Resolve data errors using transparent assumptions

  • Reduce data to the columns that are most commonly needed

  • Transform data from long to wide format

Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.

Detailed step-by-step workflow

This example workflow takes a deeper dive into some of the expanded functionality to examine results for different water quality parameters in Tampa Bay, FL

Install and import the required libraries

[1]:
import sys
#!python -m pip uninstall harmonize-wq --yes
# Use pip to install the package from pypi or the latest from github
#!{sys.executable} -m pip install harmonize-wq
# For latest dev version
#!{sys.executable} -m pip install git+https://github.com/USEPA/harmonize-wq.git@new_release_0-3-8
[2]:
import dataretrieval.wqp as wqp
from harmonize_wq import wrangle
from harmonize_wq import location
from harmonize_wq import harmonize
from harmonize_wq import visualize
from harmonize_wq import clean

Download location data using dataretrieval

[3]:
# Read geometry for Area of Interest from geojson file url and plot
aoi_url = r'https://github.com/USEPA/Coastal_Ecological_Indicators/raw/master/DGGS_Coastal/temperature_data/TampaBay.geojson'
# geoJSON should be WGS1984 standard, but this one isn't
aoi_gdf = wrangle.as_gdf(aoi_url).to_crs(epsg=4326)
aoi_gdf.plot()
[3]:
<Axes: >
../_images/notebooks_Harmonize_Tampa_Detailed_9_1.png
[4]:
# Build query with characteristicNames and the AOI extent
query = {'characteristicName': ['Phosphorus',
                                'Temperature, water',
                                'Depth, Secchi disk depth',
                                'Dissolved oxygen (DO)',
                                'Salinity',
                                'pH',
                                'Nitrogen',
                                'Conductivity',
                                'Organic carbon',
                                'Chlorophyll a',
                                'Turbidity',
                                'Sediment',
                                'Fecal Coliform',
                                'Escherichia coli']}
query['bBox'] =wrangle.get_bounding_box(aoi_gdf)
[5]:
# Query stations (can be slow)
stations, site_md = wqp.what_sites(**query)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/dataretrieval/wqp.py:210: DtypeWarning: Columns (8,10,14,21,27,29,30,33,35) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(StringIO(response.text), delimiter=",")
[6]:
# Rows and columns for results
stations.shape
[6]:
(16539, 37)
[7]:
# First 5 rows
stations.head()
[7]:
OrganizationIdentifier OrganizationFormalName MonitoringLocationIdentifier MonitoringLocationName MonitoringLocationTypeName MonitoringLocationDescriptionText HUCEightDigitCode DrainageAreaMeasure/MeasureValue DrainageAreaMeasure/MeasureUnitCode ContributingDrainageAreaMeasure/MeasureValue ... AquiferName LocalAqfrName FormationTypeText AquiferTypeName ConstructionDateText WellDepthMeasure/MeasureValue WellDepthMeasure/MeasureUnitCode WellHoleDepthMeasure/MeasureValue WellHoleDepthMeasure/MeasureUnitCode ProviderName
0 USGS-FL USGS Florida Water Science Center USGS-02300009 MANATEE RIVER AT DEVILS ELBOW NEAR FT HAMER FL Estuary NaN 3100202.0 139.0 sq mi NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
1 USGS-FL USGS Florida Water Science Center USGS-02300018 GAMBLE CREEK NEAR PARRISH FL Stream NaN 3100202.0 50.6 sq mi NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
2 USGS-FL USGS Florida Water Science Center USGS-02300021 MANATEE RIVER AT FORT HAMER FL Estuary NaN 3100202.0 216.0 sq mi NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
3 USGS-FL USGS Florida Water Science Center USGS-02300062 GLEN CREEK NEAR BRADENTON FL Stream NaN 3100202.0 2.5 sq mi NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
4 USGS-FL USGS Florida Water Science Center USGS-02300064 BRADEN RIVER AT BRADENTON FL Stream NaN 3100202.0 83.0 sq mi NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS

5 rows × 37 columns

[8]:
# Columns used for an example row
stations.iloc[0][['HorizontalCoordinateReferenceSystemDatumName', 'LatitudeMeasure', 'LongitudeMeasure']]
[8]:
HorizontalCoordinateReferenceSystemDatumName        NAD83
LatitudeMeasure                                 27.520872
LongitudeMeasure                                -82.40176
Name: 0, dtype: object
[9]:
# Harmonize location datums to 4326 (Note we keep intermediate columns using intermediate_columns=True)
stations_gdf = location.harmonize_locations(stations, outEPSG=4326, intermediate_columns=True)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:356: FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less sequences (e.g. list, tuple) are deprecated and will raise in a future version. Wrap the object in a Series, Index, or np.array before operating instead.
  cond_notna = mask & (df_out["QA_flag"].notna())  # Mask cond and not NA
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:360: FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less sequences (e.g. list, tuple) are deprecated and will raise in a future version. Wrap the object in a Series, Index, or np.array before operating instead.
  df_out.loc[mask & (df_out["QA_flag"].isna()), "QA_flag"] = flag
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:360: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'LatitudeMeasure: Imprecise: lessthan3decimaldigits' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[mask & (df_out["QA_flag"].isna()), "QA_flag"] = flag
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:356: FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less sequences (e.g. list, tuple) are deprecated and will raise in a future version. Wrap the object in a Series, Index, or np.array before operating instead.
  cond_notna = mask & (df_out["QA_flag"].notna())  # Mask cond and not NA
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:360: FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less sequences (e.g. list, tuple) are deprecated and will raise in a future version. Wrap the object in a Series, Index, or np.array before operating instead.
  df_out.loc[mask & (df_out["QA_flag"].isna()), "QA_flag"] = flag
[10]:
# Every function has a dostring to help understand input/output and what it does
location.harmonize_locations?
[11]:
# Rows and columns for results after running the function (5 new columns, only 2 new if intermediate_columns=False)
stations_gdf.shape
[11]:
(16539, 42)
[12]:
# Example results for the new columns
stations_gdf.iloc[0][['geom_orig', 'EPSG', 'QA_flag', 'geom', 'geometry']]
[12]:
geom_orig         (-82.4017604, 27.5208719)
EPSG                                 4269.0
QA_flag                                 NaN
geom         POINT (-82.4017604 27.5208719)
geometry     POINT (-82.4017604 27.5208719)
Name: 0, dtype: object
[13]:
# geom and geometry look the same but geometry is a special datatype
stations_gdf['geometry'].dtype
[13]:
<geopandas.array.GeometryDtype at 0x7f22ab398880>
[14]:
# Look at the different QA_flag flags that have been assigned,
# e.g., for bad datums or limited decimal precision
set(stations_gdf.loc[stations_gdf['QA_flag'].notna()]['QA_flag'])
[14]:
{'HorizontalCoordinateReferenceSystemDatumName: Bad datum OTHER, EPSG:4326 assumed',
 'HorizontalCoordinateReferenceSystemDatumName: Bad datum UNKWN, EPSG:4326 assumed',
 'LatitudeMeasure: Imprecise: lessthan3decimaldigits',
 'LatitudeMeasure: Imprecise: lessthan3decimaldigits; HorizontalCoordinateReferenceSystemDatumName: Bad datum UNKWN, EPSG:4326 assumed',
 'LatitudeMeasure: Imprecise: lessthan3decimaldigits; LongitudeMeasure: Imprecise: lessthan3decimaldigits',
 'LongitudeMeasure: Imprecise: lessthan3decimaldigits',
 'LongitudeMeasure: Imprecise: lessthan3decimaldigits; HorizontalCoordinateReferenceSystemDatumName: Bad datum OTHER, EPSG:4326 assumed'}
[15]:
# Map it
stations_gdf.plot()
[15]:
<Axes: >
../_images/notebooks_Harmonize_Tampa_Detailed_21_1.png
[16]:
# Clip it to area of interest
stations_clipped = wrangle.clip_stations(stations_gdf, aoi_gdf)
[17]:
# Map it
stations_clipped.plot()
[17]:
<Axes: >
../_images/notebooks_Harmonize_Tampa_Detailed_23_1.png
[18]:
# How many stations now?
len(stations_clipped)
[18]:
10930
[19]:
# To save the results to a shapefile
#import os
#path = ''  #specify the path (folder/directory) to save it to
#stations_clipped.to_file(os.path.join(path, 'Tampa_stations.shp'))

Retrieve Characteristic Data

[20]:
# Now query for results
query['dataProfile'] = 'narrowResult'
res_narrow, md_narrow = wqp.get_results(**query)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/dataretrieval/wqp.py:153: DtypeWarning: Columns (9,10,13,15,17,19,22,23,28,31,33,36,38,58,60,61,62,63,64,65,70,71,73) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(StringIO(response.text), delimiter=",")
[21]:
df = res_narrow
df
[21]:
OrganizationIdentifier OrganizationFormalName ActivityIdentifier ActivityStartDate ActivityStartTime/Time ActivityStartTime/TimeZoneCode MonitoringLocationIdentifier ResultIdentifier DataLoggerLine ResultDetectionConditionText ... AnalysisEndTime/TimeZoneCode ResultLaboratoryCommentCode ResultLaboratoryCommentText ResultDetectionQuantitationLimitUrl LaboratoryAccreditationIndicator LaboratoryAccreditationAuthorityName TaxonomistAccreditationIndicator TaxonomistAccreditationAuthorityName LabSamplePreparationUrl ProviderName
0 21FLHILL_WQX Environmental Protection Commission of Hillsbo... 21FLHILL_WQX-130612585-W 2013-06-12 11:01:00 EST 21FLHILL_WQX-585 STORET-301235413 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
1 21FLSEAS_WQX Florida Department of Environmental Protection 21FLSEAS_WQX-481901119134 2013-11-19 14:01:00 EST 21FLSEAS_WQX-48SEAS190 STORET-310535134 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
2 21FLHILL_WQX Environmental Protection Commission of Hillsbo... 21FLHILL_WQX-130702047-M 2013-07-02 11:01:00 EST 21FLHILL_WQX-047 STORET-300620295 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
3 21FLHILL_WQX Environmental Protection Commission of Hillsbo... 21FLHILL_WQX-130716021 2013-07-16 11:01:00 EST 21FLHILL_WQX-021 STORET-300666279 NaN NaN ... NaN NaN NaN https://www.waterqualitydata.us/data/providers... NaN NaN NaN NaN NaN STORET
4 21FLHILL_WQX Environmental Protection Commission of Hillsbo... 21FLHILL_WQX-131216112-M 2013-12-16 12:01:00 EST 21FLHILL_WQX-112 STORET-301229196 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1537974 21FLBSG City of Tampa Bay Study Group (Florida) 21FLBSG-1981152F-1981-0 1981-07-15 00:00:00 EST 21FLBSG-13 STORET-100013861879.0000000000 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
1537975 21FLBSG City of Tampa Bay Study Group (Florida) 21FLBSG-1981203F-1981-0 1981-01-20 00:00:00 EST 21FLBSG-13 STORET-100014014414.0000000000 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
1537976 21FLBSG City of Tampa Bay Study Group (Florida) 21FLBSG-1981200F-1981-0 1981-01-20 00:00:00 EST 21FLBSG-13 STORET-100014014426.0000000000 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
1537977 21FLBSG City of Tampa Bay Study Group (Florida) 21FLBSG-198173F-1981-0 1981-01-07 00:00:00 EST 21FLBSG-13 STORET-100013998412.0000000000 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
1537978 21FLBSG City of Tampa Bay Study Group (Florida) 21FLBSG-1981153F-1981-0 1981-07-15 00:00:00 EST 21FLBSG-13 STORET-100013857818.0000000000 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET

1537979 rows × 78 columns

[22]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(df, stations_clipped)
legend_kwds = {"fmt": "{:.0f}", 'bbox_to_anchor':(1, 0.75)}
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
[22]:
<Axes: >
../_images/notebooks_Harmonize_Tampa_Detailed_29_1.png

Harmonize Characteristic Results

Two options for functions to harmonize characteristics: harmonize_all() or harmonize(). harmonize_all runs functions on all characteristics and lets you specify how to handle errors harmonize runs functions only on the characteristic specified with char_val and lets you also choose output units, to keep intermediate columns and to do a quick report summarizing changes.

[23]:
# See Documentation
#harmonize.harmonize_all?
#harmonize.harmonize?

secchi disk depth

[24]:
# Each harmonize function has optional params, e.g., char_val is the characticName column value to use so we can send the entire df.
# Optional params: units='m', char_val='Depth, Secchi disk depth', out_col='Secchi', report=False)

# We start by demonstrating on secchi disk depth (units default to m, keep intermediate fields, see report)
df = harmonize.harmonize(df, 'Depth, Secchi disk depth', intermediate_columns=True, report=True)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/clean.py:360: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'ResultMeasureValue: "Not Reported" result cannot be used' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[mask & (df_out["QA_flag"].isna()), "QA_flag"] = flag
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(0.8, 'meter')> <Quantity(2.2, 'meter')>
 <Quantity(2.7, 'meter')> ... <Quantity(2.4, 'meter')>
 <Quantity(1.8, 'meter')> <Quantity(3.3, 'meter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count    92675.000000
mean         1.472673
std          0.908834
min         -9.000000
25%          0.900000
50%          1.300000
75%          1.900000
max         32.004000
dtype: float64
Unusable results: 281
Usable results with inferred units: 1
Results outside threshold (0.0 to 6.925674654247189): 55
../_images/notebooks_Harmonize_Tampa_Detailed_34_2.png

The threshold is based on standard deviations and is currently only used in the histogram.

[25]:
# Look at a table of just Secchi results and focus on subset of columns
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Units']
sechi_results = df.loc[df['CharacteristicName']=='Depth, Secchi disk depth', cols + ['Secchi']]
sechi_results
[25]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Units Secchi
36 21FLHILL_WQX-1510 0.80 m NaN m 0.8 meter
68 21FLMANA_WQX-428 2.2 m NaN m 2.2 meter
96 21FLHILL_WQX-096 2.70 m NaN m 2.7 meter
106 21FLHILL_WQX-064 0.80 m NaN m 0.8 meter
127 21FLCOSP_WQX-COSPE6-2 1.7 m NaN m 1.7 meter
... ... ... ... ... ... ...
1537886 21FLBSG-13 3.8 m NaN m 3.8 meter
1537891 21FLBSG-13 1.8 m NaN m 1.8 meter
1537902 21FLBSG-13 2.4 m NaN m 2.4 meter
1537939 21FLBSG-13 1.8 m NaN m 1.8 meter
1537969 21FLBSG-13 3.3 m NaN m 3.3 meter

92956 rows × 6 columns

[26]:
# Look at unusable(NAN) results
sechi_results.loc[df['Secchi'].isna()]
[26]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Units Secchi
510468 21FLPDEM_WQX-14-02 Not Reported m ResultMeasureValue: "Not Reported" result cann... m NaN
515365 21FLKWAT_WQX-HIL-RAINBOW-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
518679 21FLKWAT_WQX-PIN-COFFEEPOBAYOU-8 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
520604 21FLKWAT_WQX-PIN-COFFEEPOBAYOU-6 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
522083 21FLPDEM_WQX-E2-D-19-02 Not Reported m ResultMeasureValue: "Not Reported" result cann... m NaN
... ... ... ... ... ... ...
1525558 USGS-275100082280500 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... m NaN
1525611 USGS-275530082383300 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... m NaN
1525614 USGS-275530082383300 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... m NaN
1525900 USGS-275530082383300 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... m NaN
1525980 USGS-275530082383300 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... m NaN

281 rows × 6 columns

[27]:
# look at the QA flag for first row from above
list(sechi_results.loc[df['Secchi'].isna()]['QA_flag'])[0]
[27]:
'ResultMeasureValue: "Not Reported" result cannot be used'
[28]:
# All cases where there was a QA flag
sechi_results.loc[df['QA_flag'].notna()]
[28]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Units Secchi
284818 NARS_WQX-NCCA10-1674 -9 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... m -9.0 meter
510468 21FLPDEM_WQX-14-02 Not Reported m ResultMeasureValue: "Not Reported" result cann... m NaN
515365 21FLKWAT_WQX-HIL-RAINBOW-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
518679 21FLKWAT_WQX-PIN-COFFEEPOBAYOU-8 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
520604 21FLKWAT_WQX-PIN-COFFEEPOBAYOU-6 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
... ... ... ... ... ... ...
1525558 USGS-275100082280500 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... m NaN
1525611 USGS-275530082383300 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... m NaN
1525614 USGS-275530082383300 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... m NaN
1525900 USGS-275530082383300 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... m NaN
1525980 USGS-275530082383300 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... m NaN

282 rows × 6 columns

If both value and unit are missing nothing can be done, a unitless (NaN) value is assumed as to be in default units but a QA_flag is added

[29]:
# Aggregate secchi data by station
visualize.station_summary(sechi_results, 'Secchi')
[29]:
MonitoringLocationIdentifier cnt mean
0 21FLBRA-1530-A 2 0.375000
1 21FLBRA-1541B-A 3 1.166667
2 21FLBRA-1574-A 1 0.250000
3 21FLBRA-1574A-A 2 0.250000
4 21FLBRA-1574A-B 1 1.250000
... ... ... ...
12165 USGS-280630082350900 3 1.966667
12166 USGS-280635082322100 2 2.100000
12167 USGS-280640082434700 3 2.302933
12168 USGS-280719082291400 2 1.000000
12169 USGS-280730082431800 3 1.947333

12170 rows × 3 columns

[30]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(sechi_results, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/mapclassify/classifiers.py:1653: UserWarning: Not enough unique values in array to form 5 classes. Setting k to 2.
  self.bins = quantile(y, k=k)
[30]:
<Axes: >
../_images/notebooks_Harmonize_Tampa_Detailed_42_2.png
[31]:
# Map average results at each station
gdf_avg = visualize.map_measure(sechi_results, stations_clipped, 'Secchi')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[31]:
<Axes: >
../_images/notebooks_Harmonize_Tampa_Detailed_43_1.png

Temperature

The default error=’raise’, makes it so that there is an error when there is a dimensionality error (i.e. when units can’t be converted). Here we would get the error: DimensionalityError: Cannot convert from ‘count’ (dimensionless) to ‘degree_Celsius’ ([temperature])

[32]:
#'Temperature, water'
# Note: Default errors='raise'
df = harmonize.harmonize(df, 'Temperature, water', intermediate_columns=True, report=True)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(28.19, 'degree_Celsius')> <Quantity(29.52, 'degree_Celsius')>
 <Quantity(21.0, 'degree_Celsius')> ... <Quantity(31.2, 'degree_Celsius')>
 <Quantity(14.1, 'degree_Celsius')> <Quantity(31.1, 'degree_Celsius')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count    313146.000000
mean         25.280097
std          78.216584
min          -2.900000
25%          21.200000
50%          25.860000
75%          29.200000
max       43696.000000
dtype: float64
Unusable results: 174
Usable results with inferred units: 0
Results outside threshold (0.0 to 494.579598478139): 2
../_images/notebooks_Harmonize_Tampa_Detailed_46_2.png
[33]:
# Look at what was changed
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Temperature', 'Units']
temperature_results = df.loc[df['CharacteristicName']=='Temperature, water', cols]
temperature_results
[33]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Temperature Units
2 21FLHILL_WQX-047 28.19 deg C NaN 28.19 degree_Celsius degC
9 21FLTBW_WQX-M23 29.52 deg C NaN 29.52 degree_Celsius degC
14 21FLMANA_WQX-GA1 21 deg C NaN 21.0 degree_Celsius degC
34 21FLHILL_WQX-1509 27.67 deg C NaN 27.67 degree_Celsius degC
37 21FLTBW_WQX-PR103026 17.73 deg C NaN 17.73 degree_Celsius degC
... ... ... ... ... ... ...
1537970 21FLBSG-13 22.2 deg C NaN 22.2 degree_Celsius degC
1537971 21FLBSG-13 22.2 deg C NaN 22.2 degree_Celsius degC
1537972 21FLBSG-13 31.2 deg C NaN 31.2 degree_Celsius degC
1537977 21FLBSG-13 14.1 deg C NaN 14.1 degree_Celsius degC
1537978 21FLBSG-13 31.1 deg C NaN 31.1 degree_Celsius degC

313320 rows × 6 columns

In the above we can see examples where the results were in deg F and in the result field they’ve been converted into degree_Celsius

[34]:
# Examine missing units
temperature_results.loc[df['ResultMeasure/MeasureUnitCode'].isna()]
[34]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Temperature Units
514605 21FLPDEM_WQX-19-13 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... NaN degC
515694 21FLPDEM_WQX-24-07 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... NaN degC
517127 21FLPDEM_WQX-12-04 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... NaN degC
522167 21FLPDEM_WQX-23-08 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... NaN degC
528255 21FLPDEM_WQX-04-04 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... NaN degC
... ... ... ... ... ... ...
893017 21FLPDEM_WQX-35-01 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... NaN degC
896720 21FLPDEM_WQX-23-08 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... NaN degC
1405328 USGS-280228082343000 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN degC
1519325 USGS-02306028 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN degC
1519608 USGS-02306028 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN degC

87 rows × 6 columns

We can see where the units were missing, the results were assumed to be in degree_Celsius already

[35]:
# This is also noted in the QA_flag field
list(temperature_results.loc[df['ResultMeasure/MeasureUnitCode'].isna(), 'QA_flag'])[0]
[35]:
'ResultMeasureValue: "Not Reported" result cannot be used; ResultMeasure/MeasureUnitCode: MISSING UNITS, degC assumed'
[36]:
# Look for any without usable results
temperature_results.loc[df['Temperature'].isna()]
[36]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Temperature Units
514605 21FLPDEM_WQX-19-13 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... NaN degC
515694 21FLPDEM_WQX-24-07 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... NaN degC
517127 21FLPDEM_WQX-12-04 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... NaN degC
522167 21FLPDEM_WQX-23-08 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... NaN degC
528255 21FLPDEM_WQX-04-04 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... NaN degC
... ... ... ... ... ... ...
1442860 21FLPDEM_WQX-12-02 Not Reported deg C ResultMeasureValue: "Not Reported" result cann... NaN degC
1461996 21FLPDEM_WQX-24-01 NaN deg C ResultMeasureValue: missing (NaN) result NaN degC
1462509 21FLPDEM_WQX-04-04 NaN deg C ResultMeasureValue: missing (NaN) result NaN degC
1519325 USGS-02306028 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN degC
1519608 USGS-02306028 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN degC

174 rows × 6 columns

[37]:
# Aggregate temperature data by station
visualize.station_summary(temperature_results, 'Temperature')
[37]:
MonitoringLocationIdentifier cnt mean
0 21FLBRA-1530-A 12 27.593333
1 21FLBRA-1530-B 7 26.290000
2 21FLBRA-1541A-A 6 26.016667
3 21FLBRA-1541B-A 6 26.743333
4 21FLBRA-1574-A 5 27.890000
... ... ... ...
15334 USGS-280726082313300 4 28.025000
15335 USGS-280728082301101 54 25.083333
15336 USGS-280729082313501 1 27.400000
15337 USGS-280730082313201 1 24.700000
15338 USGS-280730082431800 11 22.018182

15339 rows × 3 columns

[38]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(temperature_results, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/mapclassify/classifiers.py:1653: UserWarning: Not enough unique values in array to form 5 classes. Setting k to 4.
  self.bins = quantile(y, k=k)
[38]:
<Axes: >
../_images/notebooks_Harmonize_Tampa_Detailed_54_2.png
[39]:
# Map average results at each station
gdf_avg = visualize.map_measure(temperature_results, stations_clipped, 'Temperature')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[39]:
<Axes: >
../_images/notebooks_Harmonize_Tampa_Detailed_55_1.png

Dissolved oxygen

[40]:
# look at Dissolved oxygen (DO), but this time without intermediate fields
df = harmonize.harmonize(df, 'Dissolved oxygen (DO)')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(9.32, 'milligram / liter')>
 <Quantity(8.08, 'milligram / liter')>
 <Quantity(4.9, 'milligram / liter')> ...
 <Quantity(7.9, 'milligram / liter')> <Quantity(5.2, 'milligram / liter')>
 <Quantity(4.9, 'milligram / liter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)

Note: Imediately when we run a harmonization function without the intermediate fields they’re deleted.

[41]:
# Look at what was changed
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'DO']
do_res = df.loc[df['CharacteristicName']=='Dissolved oxygen (DO)', cols]
do_res
[41]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag DO
0 21FLHILL_WQX-585 9.32 mg/L NaN 9.32 milligram / liter
10 21FLHILL_WQX-1606 8.08 mg/L NaN 8.08 milligram / liter
12 21FLHILL_WQX-1611 4.90 mg/L NaN 4.9 milligram / liter
16 21FLHILL_WQX-1606 2.56 mg/L NaN 2.56 milligram / liter
19 21FLPDEM_WQX-24-01 68.7 % NaN 0.05676222371166 milligram / liter
... ... ... ... ... ...
1536237 21FLSEAS-48SEAS140 8.1 mg/l NaN 8.1 milligram / liter
1536239 21FLSEAS-48SEAS070 6.0 mg/l NaN 6.0 milligram / liter
1536249 21FLSEAS-48SEAS070 7.9 mg/l NaN 7.9 milligram / liter
1536255 21FLSEAS-42SEAS731 5.2 mg/l NaN 5.2 milligram / liter
1536270 21FLSEAS-48SEAS040 4.9 mg/l NaN 4.9 milligram / liter

282694 rows × 5 columns

[42]:
do_res.loc[do_res['ResultMeasure/MeasureUnitCode']!='mg/l']
[42]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag DO
0 21FLHILL_WQX-585 9.32 mg/L NaN 9.32 milligram / liter
10 21FLHILL_WQX-1606 8.08 mg/L NaN 8.08 milligram / liter
12 21FLHILL_WQX-1611 4.90 mg/L NaN 4.9 milligram / liter
16 21FLHILL_WQX-1606 2.56 mg/L NaN 2.56 milligram / liter
19 21FLPDEM_WQX-24-01 68.7 % NaN 0.05676222371166 milligram / liter
... ... ... ... ... ...
1507380 21FLHILL_WQX-36 7.66 mg/L NaN 7.66 milligram / liter
1507382 21FLHILL_WQX-1608 5.75 mg/L NaN 5.75 milligram / liter
1507393 21FLHILL_WQX-1611 7.09 mg/L NaN 7.09 milligram / liter
1507396 21FLHILL_WQX-265 5.41 mg/L NaN 5.41 milligram / liter
1507397 21FLHILL_WQX-137 9.17 mg/L NaN 9.17 milligram / liter

185705 rows × 5 columns

Though there were no results in %, the conversion from percent saturation (%) to mg/l is special. This equation is being improved by integrating tempertaure and pressure instead of assuming STP (see DO_saturation())

[43]:
# Aggregate data by station
visualize.station_summary(do_res, 'DO')
[43]:
MonitoringLocationIdentifier cnt mean
0 21FLBRA-1530-A 12 2.785000
1 21FLBRA-1530-B 7 4.042857
2 21FLBRA-1541A-A 6 4.721667
3 21FLBRA-1541B-A 6 6.600000
4 21FLBRA-1574-A 5 4.378000
... ... ... ...
13907 NARS_WQX-NCCA10-1672 8 5.437500
13908 NARS_WQX-NCCA10-1673 20 4.115000
13909 NARS_WQX-NCCA10-1674 6 2.466667
13910 NARS_WQX-NLA06608-0161 5 6.500000
13911 NARS_WQX-NLA_FL-10127 3 9.033333

13912 rows × 3 columns

[44]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(do_res, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/mapclassify/classifiers.py:1653: UserWarning: Not enough unique values in array to form 5 classes. Setting k to 4.
  self.bins = quantile(y, k=k)
[44]:
<Axes: >
../_images/notebooks_Harmonize_Tampa_Detailed_63_2.png
[45]:
# Map average results at each station
gdf_avg = visualize.map_measure(do_res, stations_clipped, 'DO')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[45]:
<Axes: >
../_images/notebooks_Harmonize_Tampa_Detailed_64_1.png

pH

[46]:
# pH, this time looking at a report
df = harmonize.harmonize(df, 'pH', report=True)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(7.48, 'dimensionless')> <Quantity(8.18, 'dimensionless')>
 <Quantity(7.81, 'dimensionless')> ... <Quantity(7.8, 'dimensionless')>
 <Quantity(8.1, 'dimensionless')> <Quantity(8.3, 'dimensionless')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count    289471.000000
mean          7.759410
std           0.472718
min           0.370000
25%           7.510000
50%           7.890000
75%           8.070000
max          12.970000
dtype: float64
Unusable results: 194
Usable results with inferred units: 270952
Results outside threshold (0.0 to 10.595716827907024): 7
../_images/notebooks_Harmonize_Tampa_Detailed_66_2.png

Note the warnings that occur when a unit is not recognized by the package. These occur even when report=False. Future versions could include these as defined units for pH, but here it wouldn’t alter results.

[47]:
df.loc[df['CharacteristicName']=='pH', ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'pH']]
[47]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag pH
4 7.48 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 7.48 dimensionless
5 8.18 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 8.18 dimensionless
7 7.81 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 7.81 dimensionless
11 7.96 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 7.96 dimensionless
17 7.92 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 7.92 dimensionless
... ... ... ... ...
1536246 8.2 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 8.2 dimensionless
1536262 8.3 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 8.3 dimensionless
1536263 7.8 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 7.8 dimensionless
1536264 8.1 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 8.1 dimensionless
1536267 8.3 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 8.3 dimensionless

289665 rows × 4 columns

‘None’ is uninterpretable and replaced with NaN, which then gets replaced with ‘dimensionless’ since pH is unitless

Salinity

[48]:
# Salinity
df = harmonize.harmonize(df, 'Salinity', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/basis.py:343: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '@25C' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[mask, basis_col] = basis
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:510: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[nan nan nan ... nan nan nan]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  self.df[c_mask] = basis.update_result_basis(
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(40.0, 'Practical_Salinity_Units')>
 <Quantity(29.0, 'Practical_Salinity_Units')>
 <Quantity(26.04, 'Practical_Salinity_Units')> ...
 <Quantity(31.3, 'Practical_Salinity_Units')>
 <Quantity(29.0, 'Practical_Salinity_Units')>
 <Quantity(28.0, 'Practical_Salinity_Units')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count    279633.000000
mean         21.577041
std          93.242129
min          -0.020000
25%          17.700000
50%          24.870000
75%          28.650000
max       48930.000000
dtype: float64
Unusable results: 1275
Usable results with inferred units: 0
Results outside threshold (0.0 to 581.0298164813172): 4
../_images/notebooks_Harmonize_Tampa_Detailed_71_2.png
[49]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity']
df.loc[df['CharacteristicName']=='Salinity', cols]
[49]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Salinity
1 40 ppth NaN 40.0 Practical_Salinity_Units
6 29 PSS NaN 29.0 Practical_Salinity_Units
8 26.04 PSS NaN 26.04 Practical_Salinity_Units
21 0.18 ppth NaN 0.18 Practical_Salinity_Units
23 5.9 ppth NaN 5.9 Practical_Salinity_Units
... ... ... ... ...
1537967 27.8 PSS NaN 27.8 Practical_Salinity_Units
1537973 28.0 PSS NaN 28.0 Practical_Salinity_Units
1537974 31.3 PSS NaN 31.3 Practical_Salinity_Units
1537975 29.0 PSS NaN 29.0 Practical_Salinity_Units
1537976 28.0 PSS NaN 28.0 Practical_Salinity_Units

280908 rows × 4 columns

Nitrogen

[50]:
# Nitrogen
df = harmonize.harmonize(df, 'Nitrogen', report=True)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/basis.py:343: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'as N' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[mask, basis_col] = basis
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:484: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '['as N' 'as N' 'as N' 'as N' 'as N' 'as N' 'as N' 'as N' nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan 'as N' nan nan nan 'as N' nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 'as N'
 nan 'as N' nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan 'as N' nan nan nan nan 'as N' nan nan nan nan nan nan nan nan nan
 nan nan 'as N' 'as N' nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  self.df[c_mask] = basis.basis_from_method_spec(self.df[c_mask])
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(0.39, 'milligram / liter')>
 <Quantity(0.4475, 'milligram / liter')>
 <Quantity(0.425, 'milligram / liter')>
 <Quantity(0.4625, 'milligram / liter')>
 <Quantity(0.33625, 'milligram / liter')>
 <Quantity(0.28, 'milligram / liter')>
 <Quantity(0.5625, 'milligram / liter')>
 <Quantity(0.21875, 'milligram / liter')>
 <Quantity(0.629, 'milligram / liter')>
 <Quantity(0.505, 'milligram / liter')>
 <Quantity(0.253, 'milligram / liter')>
 <Quantity(0.325, 'milligram / liter')>
 <Quantity(0.253, 'milligram / liter')>
 <Quantity(0.456, 'milligram / liter')>
 <Quantity(0.183, 'milligram / liter')>
 <Quantity(0.526, 'milligram / liter')>
 <Quantity(0.264, 'milligram / liter')>
 <Quantity(0.188, 'milligram / liter')>
 <Quantity(0.346, 'milligram / liter')>
 <Quantity(0.641, 'milligram / liter')>
 <Quantity(0.392, 'milligram / liter')>
 <Quantity(0.444, 'milligram / liter')>
 <Quantity(0.274, 'milligram / liter')>
 <Quantity(0.284, 'milligram / liter')>
 <Quantity(0.321, 'milligram / liter')>
 <Quantity(0.343, 'milligram / liter')>
 <Quantity(0.384, 'milligram / liter')>
 <Quantity(0.295, 'milligram / liter')>
 <Quantity(0.20244, 'milligram / liter')>
 <Quantity(0.42266, 'milligram / liter')>
 <Quantity(0.2191, 'milligram / liter')>
 <Quantity(0.43078, 'milligram / liter')>
 <Quantity(0.19796, 'milligram / liter')>
 <Quantity(0.95186, 'milligram / liter')>
 <Quantity(0.329, 'milligram / liter')>
 <Quantity(0.20986, 'milligram / liter')>
 <Quantity(0.31556, 'milligram / liter')>
 <Quantity(0.35686, 'milligram / liter')>
 <Quantity(0.3409, 'milligram / liter')>
 <Quantity(0.2919, 'milligram / liter')>
 <Quantity(0.60508, 'milligram / liter')>
 <Quantity(0.25802, 'milligram / liter')>
 <Quantity(0.32074, 'milligram / liter')>
 <Quantity(0.64302, 'milligram / liter')>
 <Quantity(0.6727, 'milligram / liter')>
 <Quantity(0.5376, 'milligram / liter')>
 <Quantity(0.54488, 'milligram / liter')>
 <Quantity(0.3353, 'milligram / liter')>
 <Quantity(0.165, 'milligram / liter')>
 <Quantity(0.526, 'milligram / liter')>
 <Quantity(0.308, 'milligram / liter')>
 <Quantity(0.234, 'milligram / liter')>
 <Quantity(0.301, 'milligram / liter')>
 <Quantity(0.219, 'milligram / liter')>
 <Quantity(1.4, 'milligram / liter')>
 <Quantity(0.253, 'milligram / liter')>
 <Quantity(0.238, 'milligram / liter')>
 <Quantity(0.271, 'milligram / liter')>
 <Quantity(1.59, 'milligram / liter')>
 <Quantity(0.224, 'milligram / liter')>
 <Quantity(0.225, 'milligram / liter')>
 <Quantity(0.203, 'milligram / liter')>
 <Quantity(0.463, 'milligram / liter')>
 <Quantity(0.165, 'milligram / liter')>
 <Quantity(0.36, 'milligram / liter')>
 <Quantity(0.80493, 'milligram / liter')>
 <Quantity(0.523, 'milligram / liter')>
 <Quantity(0.233, 'milligram / liter')>
 <Quantity(0.402, 'milligram / liter')>
 <Quantity(0.378, 'milligram / liter')>
 <Quantity(0.412, 'milligram / liter')>
 <Quantity(0.499, 'milligram / liter')>
 <Quantity(0.49267, 'milligram / liter')>
 <Quantity(0.181, 'milligram / liter')>
 <Quantity(0.519, 'milligram / liter')>
 <Quantity(0.141, 'milligram / liter')>
 <Quantity(0.497, 'milligram / liter')>
 <Quantity(0.546, 'milligram / liter')>
 <Quantity(0.208, 'milligram / liter')>
 <Quantity(0.55243, 'milligram / liter')>
 <Quantity(0.253, 'milligram / liter')>
 <Quantity(1.02, 'milligram / liter')>
 <Quantity(0.418, 'milligram / liter')>
 <Quantity(2.7, 'milligram / liter')>
 <Quantity(0.404, 'milligram / liter')>
 <Quantity(0.178, 'milligram / liter')>
 <Quantity(0.437, 'milligram / liter')>
 <Quantity(0.333, 'milligram / liter')>
 <Quantity(0.208, 'milligram / liter')>
 <Quantity(0.344, 'milligram / liter')>
 <Quantity(0.275, 'milligram / liter')>
 <Quantity(0.238, 'milligram / liter')>
 <Quantity(0.223, 'milligram / liter')>
 <Quantity(0.288, 'milligram / liter')>
 <Quantity(0.421, 'milligram / liter')>
 <Quantity(0.475, 'milligram / liter')>
 <Quantity(0.539, 'milligram / liter')>
 <Quantity(0.3, 'milligram / liter')>
 <Quantity(0.244, 'milligram / liter')>
 <Quantity(0.308, 'milligram / liter')>
 <Quantity(0.315, 'milligram / liter')>
 <Quantity(1.4, 'milligram / liter')>
 <Quantity(0.455, 'milligram / liter')>
 <Quantity(0.189, 'milligram / liter')>
 <Quantity(0.336, 'milligram / liter')>
 <Quantity(0.229, 'milligram / liter')>
 <Quantity(1.58, 'milligram / liter')>
 <Quantity(0.20901, 'milligram / liter')>
 <Quantity(0.68194, 'milligram / liter')>
 <Quantity(0.391, 'milligram / liter')>
 <Quantity(0.50134, 'milligram / liter')>
 <Quantity(0.205, 'milligram / liter')>
 <Quantity(0.57512, 'milligram / liter')>
 <Quantity(0.278, 'milligram / liter')>
 <Quantity(0.26, 'milligram / liter')>
 <Quantity(0.416, 'milligram / liter')>
 <Quantity(0.451, 'milligram / liter')>
 <Quantity(0.2163, 'milligram / liter')>
 <Quantity(1.68, 'milligram / liter')>
 <Quantity(1.57, 'milligram / liter')>
 <Quantity(0.183, 'milligram / liter')>
 <Quantity(0.105, 'milligram / liter')>
 <Quantity(0.191, 'milligram / liter')>
 <Quantity(0.606, 'milligram / liter')>
 <Quantity(0.073, 'milligram / liter')>
 <Quantity(0.124, 'milligram / liter')>
 <Quantity(0.063, 'milligram / liter')>
 <Quantity(0.036, 'milligram / liter')>
 <Quantity(0.144, 'milligram / liter')>
 <Quantity(0.056, 'milligram / liter')>
 <Quantity(0.031, 'milligram / liter')>
 <Quantity(0.101, 'milligram / liter')>
 <Quantity(0.094, 'milligram / liter')>
 <Quantity(0.069, 'milligram / liter')>
 <Quantity(0.024, 'milligram / liter')>
 <Quantity(0.08, 'milligram / liter')>
 <Quantity(0.084, 'milligram / liter')>
 <Quantity(0.042, 'milligram / liter')>
 <Quantity(0.056, 'milligram / liter')>
 <Quantity(0.024, 'milligram / liter')>
 <Quantity(0.061, 'milligram / liter')>
 <Quantity(0.054, 'milligram / liter')>
 <Quantity(0.029, 'milligram / liter')>
 <Quantity(0.077, 'milligram / liter')>
 <Quantity(0.156, 'milligram / liter')>
 <Quantity(0.038, 'milligram / liter')>
 <Quantity(0.066, 'milligram / liter')>
 <Quantity(17.6, 'milligram / liter')>
 <Quantity(22.5, 'milligram / liter')>
 <Quantity(15.7, 'milligram / liter')>
 <Quantity(19.7, 'milligram / liter')>
 <Quantity(21.3, 'milligram / liter')>
 <Quantity(15.7, 'milligram / liter')>
 <Quantity(19.5, 'milligram / liter')>
 <Quantity(16.7, 'milligram / liter')>
 <Quantity(18.0, 'milligram / liter')>
 <Quantity(18.0, 'milligram / liter')>
 <Quantity(15.3, 'milligram / liter')>
 <Quantity(0.084, 'milligram / liter')>
 <Quantity(0.166, 'milligram / liter')>
 <Quantity(0.091, 'milligram / liter')>
 <Quantity(0.057, 'milligram / liter')>
 <Quantity(0.03, 'milligram / liter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/domains.py:277: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  sub_df[cols[2]] = sub_df[cols[2]].fillna(sub_df[cols[1]])  # new_fract
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/domains.py:277: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  sub_df[cols[2]] = sub_df[cols[2]].fillna(sub_df[cols[1]])  # new_fract
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/domains.py:277: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  sub_df[cols[2]] = sub_df[cols[2]].fillna(sub_df[cols[1]])  # new_fract
-Usable results-
count    163.000000
mean       1.575389
std        4.532429
min        0.024000
25%        0.202720
50%        0.315560
75%        0.500170
max       22.500000
dtype: float64
Unusable results: 2
Usable results with inferred units: 0
Results outside threshold (0.0 to 28.769965070579055): 0
../_images/notebooks_Harmonize_Tampa_Detailed_74_2.png
[51]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Nitrogen']
df.loc[df['CharacteristicName']=='Nitrogen', cols]
[51]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Nitrogen
262029 0.39 mg/L NaN 0.39 milligram / liter
278162 0.4475 mg/L NaN 0.4475 milligram / liter
280955 0.425 mg/L NaN 0.425 milligram / liter
294787 0.4625 mg/L NaN 0.4625 milligram / liter
296176 0.33625 mg/L NaN 0.33625 milligram / liter
... ... ... ... ...
1533895 0.084 mg/l NaN 0.084 milligram / liter
1533903 0.166 mg/l NaN 0.166 milligram / liter
1533913 0.091 mg/l NaN 0.091 milligram / liter
1533947 0.057 mg/l NaN 0.057 milligram / liter
1534102 0.03 mg/l NaN 0.03 milligram / liter

165 rows × 4 columns

Conductivity

[52]:
# Conductivity
df = harmonize.harmonize(df, 'Conductivity', report=True)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(626.0, 'microsiemens / centimeter')>
 <Quantity(688.0, 'microsiemens / centimeter')>
 <Quantity(606.0, 'microsiemens / centimeter')>
 <Quantity(606.0, 'microsiemens / centimeter')>
 <Quantity(633.0, 'microsiemens / centimeter')>
 <Quantity(776.0, 'microsiemens / centimeter')>
 <Quantity(776.0, 'microsiemens / centimeter')>
 <Quantity(775.0, 'microsiemens / centimeter')>
 <Quantity(776.0, 'microsiemens / centimeter')>
 <Quantity(775.0, 'microsiemens / centimeter')>
 <Quantity(20500.0, 'microsiemens / centimeter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count       11.000000
mean      2503.363636
std       5969.279978
min        606.000000
25%        629.500000
50%        775.000000
75%        776.000000
max      20500.000000
dtype: float64
Unusable results: 8
Usable results with inferred units: 0
Results outside threshold (0.0 to 38319.04350375742): 0
../_images/notebooks_Harmonize_Tampa_Detailed_77_2.png
[53]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Conductivity']
df.loc[df['CharacteristicName']=='Conductivity', cols]
[53]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Conductivity
326137 NaN uS/cm ResultMeasureValue: missing (NaN) result NaN
330828 NaN uS/cm ResultMeasureValue: missing (NaN) result NaN
332265 NaN uS/cm ResultMeasureValue: missing (NaN) result NaN
334703 NaN uS/cm ResultMeasureValue: missing (NaN) result NaN
335567 NaN uS/cm ResultMeasureValue: missing (NaN) result NaN
337350 NaN uS/cm ResultMeasureValue: missing (NaN) result NaN
342039 NaN uS/cm ResultMeasureValue: missing (NaN) result NaN
358182 NaN uS/cm ResultMeasureValue: missing (NaN) result NaN
369692 626 uS/cm NaN 626.0 microsiemens / centimeter
374058 688 uS/cm NaN 688.0 microsiemens / centimeter
391451 606 uS/cm NaN 606.0 microsiemens / centimeter
392557 606 uS/cm NaN 606.0 microsiemens / centimeter
395780 633 uS/cm NaN 633.0 microsiemens / centimeter
658677 776 uS/cm NaN 776.0 microsiemens / centimeter
661403 776 uS/cm NaN 776.0 microsiemens / centimeter
662328 775 uS/cm NaN 775.0 microsiemens / centimeter
664393 776 uS/cm NaN 776.0 microsiemens / centimeter
667560 775 uS/cm NaN 775.0 microsiemens / centimeter
874932 20500 uS/cm NaN 20500.0 microsiemens / centimeter

Chlorophyll a

[54]:
# Chlorophyll a
df = harmonize.harmonize(df, 'Chlorophyll a', report=True)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(0.00594, 'milligram / liter')>
 <Quantity(0.00145, 'milligram / liter')>
 <Quantity(0.00277, 'milligram / liter')> ...
 <Quantity(0.00741, 'milligram / liter')>
 <Quantity(0.00662, 'milligram / liter')>
 <Quantity(0.01812, 'milligram / liter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count    43334.000000
mean         0.014368
std          0.022741
min         -0.000506
25%          0.004600
50%          0.008725
75%          0.016360
max          1.552000
dtype: float64
Unusable results: 1115
Usable results with inferred units: 4
Results outside threshold (0.0 to 0.1508140878070869): 197
../_images/notebooks_Harmonize_Tampa_Detailed_80_2.png
[55]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Chlorophyll']
df.loc[df['CharacteristicName']=='Chlorophyll a', cols]
[55]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Chlorophyll
261544 5.94 ug/L NaN 0.005940000000000001 milligram / liter
266516 1.45 ug/L NaN 0.00145 milligram / liter
267451 2.77 ug/L NaN 0.00277 milligram / liter
268302 3.87 ug/L NaN 0.00387 milligram / liter
276393 8.15 ug/L NaN 0.008150000000000001 milligram / liter
... ... ... ... ...
1537962 5.42 ug/l NaN 0.00542 milligram / liter
1537963 15.91 ug/l NaN 0.01591 milligram / liter
1537964 7.41 ug/l NaN 0.00741 milligram / liter
1537965 6.62 ug/l NaN 0.00662 milligram / liter
1537966 18.12 ug/l NaN 0.01812 milligram / liter

44449 rows × 4 columns

Organic Carbon

[56]:
# Organic carbon (%)
df = harmonize.harmonize(df, 'Organic carbon', report=True)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(3.8, 'milligram / liter')>
 <Quantity(19.5, 'milligram / liter')>
 <Quantity(11.3, 'milligram / liter')> ...
 <Quantity(4.8, 'milligram / liter')>
 <Quantity(12.9, 'milligram / liter')>
 <Quantity(3.7, 'milligram / liter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count    2.407400e+04
mean     2.206620e+04
std      1.803432e+06
min      0.000000e+00
25%      4.540000e+00
50%      7.000000e+00
75%      1.200000e+01
max      2.000000e+08
dtype: float64
Unusable results: 1956
Usable results with inferred units: 0
Results outside threshold (0.0 to 10842655.304856202): 8
../_images/notebooks_Harmonize_Tampa_Detailed_83_2.png
[57]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Carbon']
df.loc[df['CharacteristicName']=='Organic carbon', cols]
[57]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Carbon
3 3.8 mg/L NaN 3.8 milligram / liter
124 19.5 mg/L NaN 19.5 milligram / liter
125 11.3 mg/L NaN 11.3 milligram / liter
129 6.7 mg/L NaN 6.7 milligram / liter
135 28.0 mg/L NaN 28.0 milligram / liter
... ... ... ... ...
1532129 4.53 mg/l NaN 4.53 milligram / liter
1532134 1.56 mg/l NaN 1.56 milligram / liter
1532139 4.8 mg/l NaN 4.8 milligram / liter
1532144 12.9 mg/l NaN 12.9 milligram / liter
1532149 3.7 mg/l NaN 3.7 milligram / liter

26030 rows × 4 columns

Turbidity (NTU)

[58]:
# Turbidity (NTU)
df = harmonize.harmonize(df, 'Turbidity', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(4.49, 'Nephelometric_Turbidity_Units')>
 <Quantity(1.2, 'Nephelometric_Turbidity_Units')>
 <Quantity(2.9, 'Nephelometric_Turbidity_Units')> ...
 <Quantity(3.8, 'Nephelometric_Turbidity_Units')>
 <Quantity(1.0, 'Nephelometric_Turbidity_Units')>
 <Quantity(2.8, 'Nephelometric_Turbidity_Units')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count     96756.000000
mean         15.624784
std         851.847429
min          -0.047700
25%           1.500000
50%           2.500000
75%           4.100000
max      200000.000000
dtype: float64
Unusable results: 1149
Usable results with inferred units: 0
Results outside threshold (0.0 to 5126.709355321208): 157
../_images/notebooks_Harmonize_Tampa_Detailed_86_2.png
[59]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Turbidity']
df.loc[df['CharacteristicName']=='Turbidity', cols]
[59]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Turbidity
25 4.49 NTU NaN 4.49 Nephelometric_Turbidity_Units
81 1.2 NTU NaN 1.2 Nephelometric_Turbidity_Units
108 2.9 NTU NaN 2.9 Nephelometric_Turbidity_Units
147 6.9 NTU NaN 6.9 Nephelometric_Turbidity_Units
222 4.3 NTU NaN 4.3 Nephelometric_Turbidity_Units
... ... ... ... ...
1536254 1.6 NTU NaN 1.6 Nephelometric_Turbidity_Units
1536256 1.7 NTU NaN 1.7 Nephelometric_Turbidity_Units
1536257 3.8 NTU NaN 3.8 Nephelometric_Turbidity_Units
1536258 1.0 NTU NaN 1.0 Nephelometric_Turbidity_Units
1536259 2.8 NTU NaN 2.8 Nephelometric_Turbidity_Units

97905 rows × 4 columns

Sediment

[60]:
# Sediment
df = harmonize.harmonize(df, 'Sediment', report=False)
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
[61]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Sediment']
df.loc[df['CharacteristicName']=='Sediment', cols]
[61]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Sediment

Phosphorus

Note: must be merged w/ activities (package runs query by site if not already merged)

[62]:
# Phosphorus
df = harmonize.harmonize(df, 'Phosphorus')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[<Quantity(0.049, 'milligram / liter')>
 <Quantity(0.004, 'milligram / liter')>
 <Quantity(0.049, 'milligram / liter')> ...
 <Quantity(0.04, 'milligram / liter')>
 <Quantity(0.05, 'milligram / liter')>
 <Quantity(0.04, 'milligram / liter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
2 Phosphorus sample fractions not in frac_dict
2 Phosphorus sample fractions not in frac_dict found in expected domains, mapped to "Other_Phosphorus"

Note: warnings for unexpected characteristic fractions. Fractions are each seperated out into their own result column.

[63]:
# All Phosphorus
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'TDP_Phosphorus']
df.loc[df['Phosphorus'].notna(), cols]
[63]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag TDP_Phosphorus
3156 0.049 mg/L NaN NaN
7402 0.004 mg/L NaN NaN
9350 0.049 mg/L NaN NaN
11185 0.036 mg/L NaN NaN
19612 0.050 mg/L NaN 0.05 milligram / liter
... ... ... ... ...
1534192 0.065 mg/l as P NaN NaN
1534197 0.027 mg/l as P NaN NaN
1534203 0.04 mg/l as P NaN NaN
1534218 0.05 mg/l as P NaN NaN
1534235 0.04 mg/l as P NaN NaN

35664 rows × 4 columns

[64]:
# Total phosphorus
df.loc[df['TP_Phosphorus'].notna(), cols]
[64]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag TDP_Phosphorus
3156 0.049 mg/L NaN NaN
7402 0.004 mg/L NaN NaN
9350 0.049 mg/L NaN NaN
11185 0.036 mg/L NaN NaN
23731 0.004 mg/L NaN NaN
... ... ... ... ...
1534192 0.065 mg/l as P NaN NaN
1534197 0.027 mg/l as P NaN NaN
1534203 0.04 mg/l as P NaN NaN
1534218 0.05 mg/l as P NaN NaN
1534235 0.04 mg/l as P NaN NaN

33701 rows × 4 columns

[65]:
# Total dissolved phosphorus
df.loc[df['TDP_Phosphorus'].notna(), cols]
[65]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag TDP_Phosphorus
19612 0.050 mg/L NaN 0.05 milligram / liter
29925 0.009 mg/L NaN 0.009 milligram / liter
56000 0.003 mg/L NaN 0.003 milligram / liter
65845 0.050 mg/L NaN 0.05 milligram / liter
70783 0.002 mg/L NaN 0.002 milligram / liter
... ... ... ... ...
1528367 0.35 mg/l as P NaN 0.35 milligram / liter
1528370 0.2 mg/l as P NaN 0.2 milligram / liter
1528375 0.22 mg/l as P NaN 0.22 milligram / liter
1528398 0.18 mg/l as P NaN 0.18 milligram / liter
1528403 0.33 mg/l as P NaN 0.33 milligram / liter

1099 rows × 4 columns

[66]:
# All other phosphorus sample fractions
df.loc[df['Other_Phosphorus'].notna(), cols]
[66]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag TDP_Phosphorus
262349 0.13118375 mg/L NaN NaN
279116 0.1696225 mg/L NaN NaN
279554 0.0835825 mg/L NaN NaN
285845 0.16950375 mg/L NaN NaN
295477 0.03524375 mg/L NaN NaN
... ... ... ... ...
1523548 420.0 mg/kg as P NaN NaN
1523888 0.38 % NaN NaN
1523897 330.0 mg/kg as P NaN NaN
1531511 460.0 mg/kg NaN NaN
1531513 5400.0 mg/kg NaN NaN

864 rows × 4 columns

Bacteria

Some equivalence assumptions are built-in where bacteria counts that are not equivalent are treated as such because there is no standard way to convert from one to another.

Fecal Coliform

[67]:
# Known unit with bad dimensionality ('Colony_Forming_Units * milliliter')
df = harmonize.harmonize(df, 'Fecal Coliform', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[nan nan nan ... <Quantity(8.0, 'Colony_Forming_Units / milliliter')>
 <Quantity(5.0, 'Colony_Forming_Units / milliliter')>
 <Quantity(13.0, 'Colony_Forming_Units / milliliter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count    8.647000e+03
mean     4.903257e+03
std      1.318438e+05
min      0.000000e+00
25%      3.000000e+00
50%      1.100000e+01
75%      6.000000e+01
max      1.000000e+07
dtype: float64
Unusable results: 57146
Usable results with inferred units: 5
Results outside threshold (0.0 to 795966.1242988216): 8
../_images/notebooks_Harmonize_Tampa_Detailed_102_2.png
[68]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Fecal_Coliform']
df.loc[df['CharacteristicName']=='Fecal Coliform', cols]
[68]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Fecal_Coliform
13 760 cfu/100mL NaN NaN
15 2900 cfu/100mL NaN NaN
55 300 #/100mL NaN NaN
72 280 #/100mL NaN NaN
109 52 cfu/100mL NaN NaN
... ... ... ... ...
1536243 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN
1536251 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN
1536253 8.0 MPN/100ml NaN 8.0 Colony_Forming_Units / milliliter
1536260 5.0 MPN/100ml NaN 5.0 Colony_Forming_Units / milliliter
1536261 13.0 MPN/100ml NaN 13.0 Colony_Forming_Units / milliliter

65793 rows × 4 columns

Excherichia Coli

[69]:
# Known unit with bad dimensionality ('Colony_Forming_Units * milliliter')
df = harmonize.harmonize(df, 'Escherichia coli', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:158: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  for bad_meas in pandas.unique(bad_measures):
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wq_data.py:663: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[nan nan nan ... <Quantity(110.0, 'Colony_Forming_Units / milliliter')>
 <Quantity(32.0, 'Colony_Forming_Units / milliliter')>
 <Quantity(20.0, 'Colony_Forming_Units / milliliter')>]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_out.loc[m_mask, self.out_col] = convert_unit_series(**params)
-Usable results-
count      142.000000
mean       976.669014
std       4473.446618
min          0.000000
25%         21.000000
50%         46.000000
75%        120.000000
max      41000.000000
dtype: float64
Unusable results: 7603
Usable results with inferred units: 0
Results outside threshold (0.0 to 27817.348725062726): 1
../_images/notebooks_Harmonize_Tampa_Detailed_105_2.png
[70]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'E_coli']
df.loc[df['CharacteristicName']=='Escherichia coli', cols]
[70]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag E_coli
218670 210 MPN/100mL NaN NaN
220218 4800 MPN/100mL NaN NaN
220471 74.5 MPN/100mL NaN NaN
220737 553.9 MPN/100mL NaN NaN
221299 87 MPN/100mL NaN NaN
... ... ... ... ...
1517006 200.0 cfu/100ml NaN 200.0 Colony_Forming_Units / milliliter
1517111 4.0 cfu/100ml NaN 4.0 Colony_Forming_Units / milliliter
1517148 110.0 cfu/100ml NaN 110.0 Colony_Forming_Units / milliliter
1517173 32.0 cfu/100ml NaN 32.0 Colony_Forming_Units / milliliter
1517178 20.0 cfu/100ml NaN 20.0 Colony_Forming_Units / milliliter

7745 rows × 4 columns

Combining Salinity and Conductivity

Convert module has various functions to convert from one unit or characteristic to another. Some of these are used within a single characteristic during harmonization (e.g. DO saturation to concentration) while others are intended to model one characteristic as an indicator of another (e.g. estimate salinity from conductivity).

Note: this should only be done after both characteristic fields have been harmonized. Results before and after should be inspected, thresholds for outliers applied, and consider adding a QA_flag for modeled data.

Explore Salinity results:

[71]:
from harmonize_wq import convert
[72]:
# Salinity summary statistics
lst = [x.magnitude for x in list(df['Salinity'].dropna())]
q_sum = sum(lst)
print('Range: {} to {}'.format(min(lst), max(lst)))
print('Results: {} \nMean: {} PSU'.format(len(lst), q_sum/len(lst)))
Range: -0.02 to 48930.0
Results: 279633
Mean: 21.577040689455053 PSU
[73]:
# Identify extreme outliers
[x for x in lst if x >3200]
[73]:
[48930.0]

Other fields like units and QA_flag may help understand what caused high values and what results might need to be dropped from consideration

[74]:
# Columns to focus on
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity']
[75]:
# Look at important fields for max 5 values
salinity_series = df['Salinity'][df['Salinity'].notna()]
salinity_series.sort_values(ascending=False, inplace=True)
df[cols][df['Salinity'].isin(salinity_series[0:5])]
[75]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Salinity
231008 48930 ppth NaN 48930.0 Practical_Salinity_Units
387447 54.8 ppth NaN 54.8 Practical_Salinity_Units
414491 76.57 ppth NaN 76.57 Practical_Salinity_Units
702323 2976 ppth NaN 2976.0 Practical_Salinity_Units
723784 68.0 ppth NaN 68.0 Practical_Salinity_Units

Detection limits may help understand what caused low values and what results might need to be dropped or updated

[76]:
df = wrangle.add_detection(df, 'Salinity')
cols+=['ResultDetectionConditionText',
       'DetectionQuantitationLimitTypeName',
       'DetectionQuantitationLimitMeasure/MeasureValue',
       'DetectionQuantitationLimitMeasure/MeasureUnitCode']
/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/harmonize_wq/wrangle.py:501: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  detection_df = pandas.concat(detection_list).drop_duplicates()
[77]:
# Look at important fields for min 5 values (often multiple 0.0)
df[cols][df['Salinity'].isin(salinity_series[-5:])]
[77]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Salinity ResultDetectionConditionText DetectionQuantitationLimitTypeName DetectionQuantitationLimitMeasure/MeasureValue DetectionQuantitationLimitMeasure/MeasureUnitCode
25645 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN Lower Quantitation Limit 5.0 ppth
25646 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN Method Detection Level 1.0 ppth
46162 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN Lower Quantitation Limit 5.0 ppth
46163 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN Method Detection Level 1.0 ppth
508042 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
584863 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
605980 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
795838 -0.02 ppth NaN -0.02 Practical_Salinity_Units NaN NaN NaN NaN
1055340 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1058171 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1059835 0.00 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1061155 0.00 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1062274 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1063051 0.00 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1063550 0.00 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1065348 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1066524 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1068610 0.00 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1099712 -0.01 ppth NaN -0.01 Practical_Salinity_Units NaN NaN NaN NaN
1227833 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1229167 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1229176 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1229178 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1229179 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1229183 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1229184 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1229778 0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1246407 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1246418 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1247628 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1247698 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1247757 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1247906 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1248248 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1248257 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1248258 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1248293 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1248298 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1248300 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1248317 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1248326 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1248330 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1254411 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1272272 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
1507701 0 PSS NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN

Explore Conductivity results:

[78]:
# Create series and inspect Conductivity values
cond_series = df['Conductivity'].dropna()
cond_series
[78]:
371418      626.0 microsiemens / centimeter
375809      688.0 microsiemens / centimeter
393319      606.0 microsiemens / centimeter
394441      606.0 microsiemens / centimeter
397689      633.0 microsiemens / centimeter
661357      776.0 microsiemens / centimeter
664112      776.0 microsiemens / centimeter
665044      775.0 microsiemens / centimeter
667134      776.0 microsiemens / centimeter
670341      775.0 microsiemens / centimeter
878300    20500.0 microsiemens / centimeter
Name: Conductivity, dtype: object

Conductivity thresholds from Freshwater Explorer: 10 > x < 5000 us/cm, use a higher threshold for coastal waters

[79]:
# Sort and check other relevant columns before converting (e.g. Salinity)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity', 'Conductivity']
df.sort_values(by=['Conductivity'], ascending=False, inplace=True)
df.loc[df['Conductivity'].notna(), cols]
[79]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Salinity Conductivity
878300 20500 uS/cm NaN NaN 20500.0 microsiemens / centimeter
661357 776 uS/cm NaN NaN 776.0 microsiemens / centimeter
664112 776 uS/cm NaN NaN 776.0 microsiemens / centimeter
667134 776 uS/cm NaN NaN 776.0 microsiemens / centimeter
665044 775 uS/cm NaN NaN 775.0 microsiemens / centimeter
670341 775 uS/cm NaN NaN 775.0 microsiemens / centimeter
375809 688 uS/cm NaN NaN 688.0 microsiemens / centimeter
397689 633 uS/cm NaN NaN 633.0 microsiemens / centimeter
371418 626 uS/cm NaN NaN 626.0 microsiemens / centimeter
393319 606 uS/cm NaN NaN 606.0 microsiemens / centimeter
394441 606 uS/cm NaN NaN 606.0 microsiemens / centimeter
[80]:
# Convert values to PSU and write to Salinity
cond_series = cond_series.apply(str)  # Convert to string to convert to dimensionless (PSU)
df.loc[df['Conductivity'].notna(), 'Salinity'] = cond_series.apply(convert.conductivity_to_PSU)
df.loc[df['Conductivity'].notna(), 'Salinity']
[80]:
878300    12.242 dimensionless
661357     0.379 dimensionless
664112     0.379 dimensionless
667134     0.379 dimensionless
665044     0.379 dimensionless
670341     0.379 dimensionless
375809     0.335 dimensionless
397689     0.308 dimensionless
371418     0.304 dimensionless
393319     0.294 dimensionless
394441     0.294 dimensionless
Name: Salinity, dtype: object

Datetime

datetime() formats time using dataretrieval and ActivityStart

[81]:
# First inspect the existing unformated fields
cols = ['ActivityStartDate', 'ActivityStartTime/Time', 'ActivityStartTime/TimeZoneCode']
df[cols]
[81]:
ActivityStartDate ActivityStartTime/Time ActivityStartTime/TimeZoneCode
878300 2021-09-17 NaN NaN
661357 2007-08-15 NaN NaN
664112 2007-08-15 NaN NaN
667134 2007-08-15 NaN NaN
665044 2007-08-15 NaN NaN
... ... ... ...
1541618 1981-07-15 00:00:00 EST
1541619 1981-01-20 00:00:00 EST
1541620 1981-01-20 00:00:00 EST
1541621 1981-01-07 00:00:00 EST
1541622 1981-07-15 00:00:00 EST

1541623 rows × 3 columns

[82]:
# 'ActivityStartDate' presserves date where 'Activity_datetime' is NAT due to no time zone
df = clean.datetime(df)
df[['ActivityStartDate', 'Activity_datetime']]
[82]:
ActivityStartDate Activity_datetime
878300 2021-09-17 NaT
661357 2007-08-15 NaT
664112 2007-08-15 NaT
667134 2007-08-15 NaT
665044 2007-08-15 NaT
... ... ...
1541618 1981-07-15 1981-07-15 05:00:00+00:00
1541619 1981-01-20 1981-01-20 05:00:00+00:00
1541620 1981-01-20 1981-01-20 05:00:00+00:00
1541621 1981-01-07 1981-01-07 05:00:00+00:00
1541622 1981-07-15 1981-07-15 05:00:00+00:00

1541623 rows × 2 columns

Activity_datetime combines all three time component columns into UTC. If time is missing this is NaT so a ActivityStartDate column is used to preserve date only.

Depth

Note: Data are often lacking sample depth metadata

[83]:
# Depth of sample (default units='meter')
df = clean.harmonize_depth(df)
#df.loc[df['ResultDepthHeightMeasure/MeasureValue'].dropna(), "Depth"]
df['ResultDepthHeightMeasure/MeasureValue'].dropna()
[83]:
375809     0.95
397689     0.50
371418     0.00
318451     0.10
319046     0.10
           ...
1385246    0.33
1458509    0.30
1462483    0.33
1464309    0.33
1464383    0.30
Name: ResultDepthHeightMeasure/MeasureValue, Length: 505, dtype: float64

Characteristic to Column (long to wide format)

[84]:
# Split single QA column into multiple by characteristic (rename the result to preserve these QA_flags)
df2 = wrangle.split_col(df)
df2
[84]:
OrganizationIdentifier OrganizationFormalName ActivityIdentifier ActivityStartDate ActivityStartTime/Time ActivityStartTime/TimeZoneCode MonitoringLocationIdentifier ResultIdentifier DataLoggerLine ResultDetectionConditionText ... QA_E_coli QA_Turbidity QA_Conductivity QA_TP_Phosphorus QA_TDP_Phosphorus QA_Other_Phosphorus QA_Nitrogen QA_Temperature QA_Secchi QA_Salinity
878300 NARS_WQX EPA National Aquatic Resources Survey (NARS) NARS_WQX-192970_2021 2021-09-17 NaN NaN NARS_WQX-NWC_FL-10535 STORET-1040690254 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
661357 NARS_WQX EPA National Aquatic Resources Survey (NARS) NARS_WQX-PRF:0161:1:070815:3.3 2007-08-15 NaN NaN NARS_WQX-NLA06608-0161 STORET-1055145219 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
664112 NARS_WQX EPA National Aquatic Resources Survey (NARS) NARS_WQX-PRF:0161:1:070815:3 2007-08-15 NaN NaN NARS_WQX-NLA06608-0161 STORET-1055145215 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
667134 NARS_WQX EPA National Aquatic Resources Survey (NARS) NARS_WQX-PRF:0161:1:070815:2 2007-08-15 NaN NaN NARS_WQX-NLA06608-0161 STORET-1055145209 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
665044 NARS_WQX EPA National Aquatic Resources Survey (NARS) NARS_WQX-PRF:0161:1:070815:1 2007-08-15 NaN NaN NARS_WQX-NLA06608-0161 STORET-1055145207 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1541618 21FLBSG City of Tampa Bay Study Group (Florida) 21FLBSG-1981152F-1981-0 1981-07-15 00:00:00 EST 21FLBSG-13 STORET-100013861879.0000000000 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1541619 21FLBSG City of Tampa Bay Study Group (Florida) 21FLBSG-1981203F-1981-0 1981-01-20 00:00:00 EST 21FLBSG-13 STORET-100014014414.0000000000 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1541620 21FLBSG City of Tampa Bay Study Group (Florida) 21FLBSG-1981200F-1981-0 1981-01-20 00:00:00 EST 21FLBSG-13 STORET-100014014426.0000000000 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1541621 21FLBSG City of Tampa Bay Study Group (Florida) 21FLBSG-198173F-1981-0 1981-01-07 00:00:00 EST 21FLBSG-13 STORET-100013998412.0000000000 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1541622 21FLBSG City of Tampa Bay Study Group (Florida) 21FLBSG-1981153F-1981-0 1981-07-15 00:00:00 EST 21FLBSG-13 STORET-100013857818.0000000000 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

1469564 rows × 117 columns

[85]:
# This expands the single col (QA_flag) out to a number of new columns based on the unique characteristicNames and speciation
print('{} new columns'.format(len(df2.columns) - len(df.columns)))
14 new columns
[86]:
# Note: there are fewer rows because NAN results are also dropped in this step
print('{} fewer rows'.format(len(df)-len(df2)))
72059 fewer rows
[87]:
#Examine Carbon flags from earlier in notebook (note these are empty now because NAN is dropped)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'Carbon', 'QA_Carbon']
df2.loc[df2['QA_Carbon'].notna(), cols]
[87]:
ResultMeasureValue ResultMeasure/MeasureUnitCode Carbon QA_Carbon

Next the table is divided into the columns of interest (main_df) and characteristic specific metadata (chars_df)

[88]:
# split table into main and characteristics tables
main_df, chars_df = wrangle.split_table(df2)
[89]:
# Columns still in main table
main_df.columns
[89]:
Index(['OrganizationIdentifier', 'OrganizationFormalName',
       'ActivityIdentifier', 'MonitoringLocationIdentifier', 'ProviderName',
       'Secchi', 'Temperature', 'DO', 'pH', 'Salinity', 'Nitrogen',
       'Speciation', 'TOTAL NITROGEN_ MIXED FORMS', 'Conductivity',
       'Chlorophyll', 'Carbon', 'Turbidity', 'Sediment', 'Phosphorus',
       'TP_Phosphorus', 'TDP_Phosphorus', 'Other_Phosphorus', 'Fecal_Coliform',
       'E_coli', 'DetectionQuantitationLimitTypeName',
       'DetectionQuantitationLimitMeasure/MeasureValue',
       'DetectionQuantitationLimitMeasure/MeasureUnitCode',
       'Activity_datetime', 'Depth', 'QA_pH', 'QA_Carbon', 'QA_DO',
       'QA_Chlorophyll', 'QA_Fecal_Coliform', 'QA_E_coli', 'QA_Turbidity',
       'QA_Conductivity', 'QA_TP_Phosphorus', 'QA_TDP_Phosphorus',
       'QA_Other_Phosphorus', 'QA_Nitrogen', 'QA_Temperature', 'QA_Secchi',
       'QA_Salinity'],
      dtype='object')
[90]:
# look at main table results (first 5)
main_df.head()
[90]:
OrganizationIdentifier OrganizationFormalName ActivityIdentifier MonitoringLocationIdentifier ProviderName Secchi Temperature DO pH Salinity ... QA_E_coli QA_Turbidity QA_Conductivity QA_TP_Phosphorus QA_TDP_Phosphorus QA_Other_Phosphorus QA_Nitrogen QA_Temperature QA_Secchi QA_Salinity
878300 NARS_WQX EPA National Aquatic Resources Survey (NARS) NARS_WQX-192970_2021 NARS_WQX-NWC_FL-10535 STORET NaN NaN NaN NaN 12.242 dimensionless ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
661357 NARS_WQX EPA National Aquatic Resources Survey (NARS) NARS_WQX-PRF:0161:1:070815:3.3 NARS_WQX-NLA06608-0161 STORET NaN NaN NaN NaN 0.379 dimensionless ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
664112 NARS_WQX EPA National Aquatic Resources Survey (NARS) NARS_WQX-PRF:0161:1:070815:3 NARS_WQX-NLA06608-0161 STORET NaN NaN NaN NaN 0.379 dimensionless ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
667134 NARS_WQX EPA National Aquatic Resources Survey (NARS) NARS_WQX-PRF:0161:1:070815:2 NARS_WQX-NLA06608-0161 STORET NaN NaN NaN NaN 0.379 dimensionless ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
665044 NARS_WQX EPA National Aquatic Resources Survey (NARS) NARS_WQX-PRF:0161:1:070815:1 NARS_WQX-NLA06608-0161 STORET NaN NaN NaN NaN 0.379 dimensionless ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 44 columns

[91]:
# Empty columns that could be dropped (Mostly QA columns)
cols = list(main_df.columns)
x = main_df.dropna(axis=1, how='all')
[col for col in cols if col not in x.columns]
[91]:
['Sediment',
 'QA_Carbon',
 'QA_DO',
 'QA_E_coli',
 'QA_Turbidity',
 'QA_Conductivity',
 'QA_TDP_Phosphorus',
 'QA_Other_Phosphorus',
 'QA_Nitrogen',
 'QA_Temperature',
 'QA_Salinity']
[92]:
# Map average temperature at each station
results_gdf = visualize.map_measure(main_df, stations_clipped, 'Temperature')
results_gdf.plot(column='mean', cmap='OrRd', legend=True)
[92]:
<Axes: >
../_images/notebooks_Harmonize_Tampa_Detailed_143_1.png