Cape Cod - Detailed step-by-step

Standardize, clean and wrangle Water Quality Portal data in Cape Cod into more analytic-ready formats using the harmonize_wq package

US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonixe_wq package is intended to be a flexible water quality specific framework to help:

Identify differences in data units (including speciation and basis)
Identify differences in sampling or analytic methods
Resolve data errors using transparent assumptions
Reduce data to the columns that are most commonly needed
Transform data from long to wide format

Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.

Detailed step-by-step workflow

This example workflow takes a deeper dive into some of the expanded functionality to examine results for different water quality parameters in Cape Cod

Install and import the required libraries

[1]:

import sys
#! python -m pip uninstall harmonize-wq --yes
# Use pip to install the package from pypi or the latest from github
#!{sys.executable} -m pip install harmonize-wq
# For latest dev version
#!{sys.executable} -m pip install git+https://github.com/USEPA/harmonize-wq.git

[2]:

import dataretrieval.wqp as wqp
from harmonize_wq import wrangle
from harmonize_wq import location
from harmonize_wq import harmonize
from harmonize_wq import visualize
from harmonize_wq import clean

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/dataretrieval/nadp.py:44: UserWarning: GDAL not installed. Some functions will not work.
  warnings.warn('GDAL not installed. Some functions will not work.')

Download location data using dataretrieval

[3]:

# Read geometry for Area of Interest from geojson file url and plot
aoi_url = 'https://github.com/jbousquin/test_notebook/raw/main/temperature_data/NewEngland.geojson'
aoi_gdf = wrangle.as_gdf(aoi_url)  # Already 4326 standard
aoi_gdf.plot()

[3]:

<Axes: >

../_images/notebooks_Harmonize_CapeCod_Detailed_9_1.png

[4]:

# Build query with characteristicNames and the AOI extent
query = {'characteristicName': ['Phosphorus',
                                'Temperature, water',
                                'Depth, Secchi disk depth',
                                'Dissolved oxygen (DO)',
                                'Salinity',
                                'pH',
                                'Nitrogen',
                                'Conductivity',
                                'Organic carbon',
                                'Chlorophyll a',
                                'Turbidity',
                                'Sediment',
                                'Fecal Coliform',
                                'Escherichia coli']}
query['bBox'] = wrangle.get_bounding_box(aoi_gdf)

[5]:

# Query stations (can be slow)
stations, site_md = wqp.what_sites(**query)

[6]:

# Rows and columns for results
stations.shape

[6]:

(10630, 37)

[7]:

# First 5 rows
stations.head()

[7]:

	OrganizationIdentifier	OrganizationFormalName	MonitoringLocationIdentifier	MonitoringLocationName	MonitoringLocationTypeName	MonitoringLocationDescriptionText	HUCEightDigitCode	DrainageAreaMeasure/MeasureValue	DrainageAreaMeasure/MeasureUnitCode	ContributingDrainageAreaMeasure/MeasureValue	...	AquiferName	LocalAqfrName	FormationTypeText	AquiferTypeName	ConstructionDateText	WellDepthMeasure/MeasureValue	WellDepthMeasure/MeasureUnitCode	WellHoleDepthMeasure/MeasureValue	WellHoleDepthMeasure/MeasureUnitCode	ProviderName
0	USGS-MA	USGS Massachusetts Water Science Center	USGS-010965305	MERRIMACK R NR TYNGSBOROUGH BRIDGE TYNGSBOROUG...	Stream	NaN	1070006.0	4070.00	sq mi	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
1	USGS-MA	USGS Massachusetts Water Science Center	USGS-01096544	STONY BROOK AT SCHOOL STREET AT CHELMSFORD, MA	Stream	NaN	1070006.0	41.57	sq mi	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
2	USGS-MA	USGS Massachusetts Water Science Center	USGS-01096546	STONY BROOK AT CHELMSFORD, MA	Stream	NaN	1070006.0	43.60	sq mi	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
3	USGS-MA	USGS Massachusetts Water Science Center	USGS-01096548	STONY BROOK AT N CHELMSFORD, MA	Stream	NaN	1070006.0	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
4	USGS-MA	USGS Massachusetts Water Science Center	USGS-01096550	MERRIMACK RIVER ABOVE LOWELL, MA	Stream	NaN	1070006.0	3900.00	sq mi	3900.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS

5 rows × 37 columns

[8]:

# Columns used for an example row
stations.iloc[0][['HorizontalCoordinateReferenceSystemDatumName', 'LatitudeMeasure', 'LongitudeMeasure']]

[8]:

HorizontalCoordinateReferenceSystemDatumName        NAD83
LatitudeMeasure                                 42.677389
LongitudeMeasure                               -71.421056
Name: 0, dtype: object

[9]:

# Harmonize location datums to 4326 (Note we keep intermediate columns using intermediate_columns=True)
stations_gdf = location.harmonize_locations(stations, outEPSG=4326, intermediate_columns=True)

[10]:

# Every function has a dostring to help understand input/output and what it does
location.harmonize_locations?

[11]:

# Rows and columns for results after running the function (5 new columns, only 2 new if intermediate_columns=False)
stations_gdf.shape

[11]:

(10630, 42)

[12]:

# Example results for the new columns
stations_gdf.iloc[0][['geom_orig', 'EPSG', 'QA_flag', 'geom', 'geometry']]

[12]:

geom_orig         (-71.4210556, 42.67738889)
EPSG                                  4269.0
QA_flag                                  NaN
geom         POINT (-71.4210556 42.67738889)
geometry     POINT (-71.4210556 42.67738889)
Name: 0, dtype: object

[13]:

# geom and geometry look the same but geometry is a special datatype
stations_gdf['geometry'].dtype

[13]:

<geopandas.array.GeometryDtype at 0x7f71e61e3490>

[14]:

# Look at the different QA_flag flags that have been assigned,
# e.g., for bad datums or limited decimal precision
set(stations_gdf.loc[stations_gdf['QA_flag'].notna()]['QA_flag'])

[14]:

{'HorizontalCoordinateReferenceSystemDatumName: Bad datum OTHER, EPSG:4326 assumed',
 'HorizontalCoordinateReferenceSystemDatumName: Bad datum UNKWN, EPSG:4326 assumed',
 'LatitudeMeasure: Imprecise: lessthan3decimaldigits',
 'LatitudeMeasure: Imprecise: lessthan3decimaldigits; LongitudeMeasure: Imprecise: lessthan3decimaldigits',
 'LongitudeMeasure: Imprecise: lessthan3decimaldigits'}

[15]:

# Map it
stations_gdf.plot()

[15]:

<Axes: >

../_images/notebooks_Harmonize_CapeCod_Detailed_21_1.png

[16]:

# Clip to area of interest
stations_clipped = wrangle.clip_stations(stations_gdf, aoi_gdf)

[17]:

# Map it
stations_clipped.plot()

[17]:

<Axes: >

../_images/notebooks_Harmonize_CapeCod_Detailed_23_1.png

[18]:

# How many stations now?
len(stations_clipped)

[18]:

[19]:

# To save the results to a shapefile
#import os
#path = ''  #specify the path (folder/directory) to save it to
#stations_clipped.to_file(os.path.join(path,'CapeCod_stations.shp'))

Retrieve Characteristic Data

[20]:

# Now query for results
query['dataProfile'] = 'narrowResult'
res_narrow, md_narrow = wqp.get_results(**query)

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/dataretrieval/wqp.py:83: DtypeWarning: Columns (8,10,13,15,17,19,20,21,22,23,28,31,32,33,34,36,38,60,63,64,65,66,67,68,69,70,71,72) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(StringIO(response.text), delimiter=',')

[21]:

df = res_narrow
df

[21]:

	OrganizationIdentifier	OrganizationFormalName	ActivityIdentifier	ActivityStartDate	ActivityStartTime/Time	ActivityStartTime/TimeZoneCode	MonitoringLocationIdentifier	ResultIdentifier	DataLoggerLine	ResultDetectionConditionText	...	AnalysisEndTime/TimeZoneCode	ResultLaboratoryCommentCode	ResultLaboratoryCommentText	ResultDetectionQuantitationLimitUrl	LaboratoryAccreditationIndicator	LaboratoryAccreditationAuthorityName	TaxonomistAccreditationIndicator	TaxonomistAccreditationAuthorityName	LabSamplePreparationUrl	ProviderName
0	BRC	Blackstone River Coalition (Volunteer)	BRC-C-02-02-020:20131012040800:FM:0.2083333333...	2013-10-12	04:08:00	EST	BRC-C-02-02-020	STORET-1039097035	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	STORET
1	OARS	OARS - For the Assabet Sudbury & Concord River...	OARS-CND-161:20130818:0651:FM:1	2013-08-18	06:51:00	EDT	OARS-CND-161	STORET-838568413	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	STORET
2	OARS	OARS - For the Assabet Sudbury & Concord River...	OARS-SUD-064:20130721:0610:FM:1	2013-07-21	06:10:00	EDT	OARS-SUD-064	STORET-838568309	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	STORET
3	CRWA	Charles River Watershed Association (Massachus...	CRWA-CYN20130809ROBTemp01	2013-08-09	11:14:33	EST	CRWA-ROB	STORET-591631481	130809111433.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	STORET
4	WTGHA	Wompanoag Tribe of Gay Head Aquinnah (Tribal)	WTGHA-m41:20131017:FM:B	2013-10-17	NaN	NaN	WTGHA-M41	STORET-1041401688	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	STORET
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
457247	USGS-MA	USGS Massachusetts Water Science Center	nwisma.01.02400495	2024-01-08	12:00:00	EST	USGS-01098530	NWIS-126836828	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
457248	USGS-MA	USGS Massachusetts Water Science Center	nwisma.01.02400639	2024-02-12	08:45:00	EST	USGS-01097050	NWIS-126989046	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
457249	USGS-MA	USGS Massachusetts Water Science Center	nwisma.01.02400639	2024-02-12	08:45:00	EST	USGS-01097050	NWIS-126989056	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
457250	USGS-MA	USGS Massachusetts Water Science Center	nwisma.01.02400639	2024-02-12	08:45:00	EST	USGS-01097050	NWIS-126989063	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
457251	USGS-MA	USGS Massachusetts Water Science Center	nwisma.01.02400639	2024-02-12	08:45:00	EST	USGS-01097050	NWIS-126989064	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS

457252 rows × 78 columns

[22]:

# Map number of usable results at each station
gdf_count = visualize.map_counts(df, stations_clipped)
legend_kwds = {"fmt": "{:.0f}", 'bbox_to_anchor':(1, 0.75)}
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)

[22]:

<Axes: >

../_images/notebooks_Harmonize_CapeCod_Detailed_29_1.png

Harmonize Characteristic Results

Two options for functions to harmonize characteristics: harmonize_all() or harmonize_generic(). harmonize_all runs functions on all characteristics and lets you specify how to handle errors harmonize_generic runs functions only on the characteristic specified with char_val and lets you also choose output units, to keep intermediate columns and to do a quick report summarizing changes.

[23]:

# See Documentation
#harmonize_WQP.harmonize_all?
#harmonize_WQP.harmonize?

secchi disk depth

[24]:

# Each harmonize function has optional params, e.g., char_val is the characticName column value to use so we can send the entire df.
# Optional params: units='m', char_val='Depth, Secchi disk depth', out_col='Secchi', report=False)

# We start by demonstrating on secchi disk depth (units default to m, keep intermediate fields, see report)
df = harmonize.harmonize(df, "Depth, Secchi disk depth", errors="ignore", intermediate_columns=True, report=True)

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'None' UNDEFINED UNIT for Secchi
  warn("WARNING: " + problem)

-Usable results-
count    4865.000000
mean      -12.374142
std       405.355648
min     -9999.000000
25%         1.600000
50%         3.500000
75%         6.000000
max        17.000000
dtype: float64
Unusable results: 264
Usable results with inferred units: 0
Results outside threshold (0.0 to 2419.7597455697323): 16

../_images/notebooks_Harmonize_CapeCod_Detailed_34_2.png

The threshold is based on standard deviations and is currently only used in the histogram.

[25]:

# Look at a table of just Secchi results and focus on subset of columns
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Units']
sechi_results = df.loc[df['CharacteristicName']=='Depth, Secchi disk depth', cols + ['Secchi']]
sechi_results

[25]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Units	Secchi
48	11113300-GRTKINSD	3.25	m	NaN	m	3.25 meter
173	11113300-BEADERD	4.25	m	NaN	m	4.25 meter
302	11113300-ANGSDND	3.5	m	NaN	m	3.5 meter
668	11113300-GRTKINSD	2.25	m	NaN	m	2.25 meter
707	11113300-GRTKINSD	4.75	m	NaN	m	4.75 meter
...	...	...	...	...	...	...
456562	11NPSWRD_WQX-CACO_DUCK_W	16.5	m	NaN	m	16.5 meter
456563	11NPSWRD_WQX-CACO_DUCK_W	10.0	m	NaN	m	10.0 meter
456564	11NPSWRD_WQX-CACO_DUCK_W	17.0	m	NaN	m	17.0 meter
456565	11NPSWRD_WQX-CACO_DUCK_W	16.0	m	NaN	m	16.0 meter
456566	11NPSWRD_WQX-CACO_GREAT_W	7.8	m	NaN	m	7.8 meter

5129 rows × 6 columns

[26]:

# Look at unusable(NAN) results
sechi_results.loc[df['Secchi'].isna()]

[26]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Units	Secchi
12509	NARS_WQX-NLA12_MA-102	NaN	NaN	ResultMeasureValue: missing (NaN) result; Resu...	m	NaN
35869	EPA_OCMA_R1-SS-2	dark	m	ResultMeasureValue: "dark" result cannot be used	m	NaN
38787	EPA_OCMA_R1-SS-3	dark	m	ResultMeasureValue: "dark" result cannot be used	m	NaN
40895	EPA_OCMA_R1-R1-23	dark	m	ResultMeasureValue: "dark" result cannot be used	m	NaN
43072	EPA_OCMA_R1-R1-25	dark	m	ResultMeasureValue: "dark" result cannot be used	m	NaN
...	...	...	...	...	...	...
368189	11NPSWRD_WQX-CACO_GREAT_W	NaN	m	ResultMeasureValue: missing (NaN) result	m	NaN
368293	11NPSWRD_WQX-CACO_GREAT_W	NaN	m	ResultMeasureValue: missing (NaN) result	m	NaN
368649	11NPSWRD_WQX-CACO_GREAT_W	NaN	m	ResultMeasureValue: missing (NaN) result	m	NaN
369416	11NPSWRD_WQX-CACO_GREAT_W	NaN	m	ResultMeasureValue: missing (NaN) result	m	NaN
456553	11NPSWRD_WQX-CACO_SLOUGH	NaN	m	ResultMeasureValue: missing (NaN) result	m	NaN

264 rows × 6 columns

[27]:

# look at the QA flag for first row from above
list(sechi_results.loc[df['Secchi'].isna()]['QA_flag'])[0]

[27]:

'ResultMeasureValue: missing (NaN) result; ResultMeasure/MeasureUnitCode: MISSING UNITS, m assumed'

[28]:

# All cases where there was a QA flag
sechi_results.loc[df['QA_flag'].notna()]

[28]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Units	Secchi
12509	NARS_WQX-NLA12_MA-102	NaN	NaN	ResultMeasureValue: missing (NaN) result; Resu...	m	NaN
35869	EPA_OCMA_R1-SS-2	dark	m	ResultMeasureValue: "dark" result cannot be used	m	NaN
35993	NARS_WQX-NCCA10-1070	-9	None	ResultMeasure/MeasureUnitCode: 'None' UNDEFINE...	m	-9.0 meter
36143	NARS_WQX-NCCA10-1029	-9	None	ResultMeasure/MeasureUnitCode: 'None' UNDEFINE...	m	-9.0 meter
37270	NARS_WQX-NCCA10-1001	-9	None	ResultMeasure/MeasureUnitCode: 'None' UNDEFINE...	m	-9.0 meter
...	...	...	...	...	...	...
368189	11NPSWRD_WQX-CACO_GREAT_W	NaN	m	ResultMeasureValue: missing (NaN) result	m	NaN
368293	11NPSWRD_WQX-CACO_GREAT_W	NaN	m	ResultMeasureValue: missing (NaN) result	m	NaN
368649	11NPSWRD_WQX-CACO_GREAT_W	NaN	m	ResultMeasureValue: missing (NaN) result	m	NaN
369416	11NPSWRD_WQX-CACO_GREAT_W	NaN	m	ResultMeasureValue: missing (NaN) result	m	NaN
456553	11NPSWRD_WQX-CACO_SLOUGH	NaN	m	ResultMeasureValue: missing (NaN) result	m	NaN

272 rows × 6 columns

If both value and unit are missing nothing can be done, a unitless (NaN) value is assumed as to be in default units but a QA_flag is added

[29]:

# Aggregate secchi data by station
visualize.station_summary(sechi_results, 'Secchi')

[29]:

	MonitoringLocationIdentifier	cnt	mean
0	11113300-ANGSDND	42	3.796369
1	11113300-BEADERD	42	3.644048
2	11113300-CANWIND	22	6.144000
3	11113300-CAPSALD	24	2.456262
4	11113300-COBWINND	21	3.768810
...	...	...	...
691	WWMD_VA-SH1	2	1.500000
692	WWMD_VA-SH2	2	1.800000
693	WWMD_VA-SR6A	1	0.500000
694	WWMD_VA-WF2	1	0.600000
695	WWMD_VA-WR5	1	2.000000

696 rows × 3 columns

[30]:

# Map number of usable results at each station
gdf_count = visualize.map_counts(sechi_results, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/mapclassify/classifiers.py:1653: UserWarning: Not enough unique values in array to form 5 classes. Setting k to 4.
  self.bins = quantile(y, k=k)

[30]:

<Axes: >

../_images/notebooks_Harmonize_CapeCod_Detailed_42_2.png

[31]:

# Map average results at each station
gdf_avg = visualize.map_measure(sechi_results, stations_clipped, 'Secchi')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)

[31]:

<Axes: >

../_images/notebooks_Harmonize_CapeCod_Detailed_43_1.png

Temperature

The default error=’raise’, makes it so that there is an error when there is a dimensionality error (i.e. when units can’t be converted). Here we would get the error: DimensionalityError: Cannot convert from ‘count’ (dimensionless) to ‘degree_Celsius’ ([temperature])

[32]:

#'Temperature, water'
# errors=‘ignore’, invalid dimension conversions will return the NaN.
df = harmonize.harmonize(df, 'Temperature, water', intermediate_columns=True, report=True, errors='ignore')

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'count' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")

-Usable results-
count    125644.000000
mean         16.023829
std           7.328661
min          -6.000000
25%          11.000000
50%          16.800000
75%          21.660000
max         910.000000
dtype: float64
Unusable results: 213
Usable results with inferred units: 217
Results outside threshold (0.0 to 59.99579510818923): 177

../_images/notebooks_Harmonize_CapeCod_Detailed_46_2.png

[33]:

# Look at what was changed
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Temperature', 'Units']
temperature_results = df.loc[df['CharacteristicName']=='Temperature, water', cols]
temperature_results

[33]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Temperature	Units
1	OARS-CND-161	22.18	deg C	NaN	22.18 degree_Celsius	degC
3	CRWA-ROB	23.01	deg C	NaN	23.01 degree_Celsius	degC
5	WTGHA-M43	17.39	deg C	NaN	17.39 degree_Celsius	degC
8	MASSDEP-W2386	20.9	deg C	NaN	20.9 degree_Celsius	degC
12	WTGHA-S35	22.54	deg C	NaN	22.54 degree_Celsius	degC
...	...	...	...	...	...	...
457234	USGS-413831070304703	11.3	deg C	NaN	11.3 degree_Celsius	degC
457236	USGS-01100627	7.0	deg C	NaN	7.0 degree_Celsius	degC
457240	USGS-01096548	2.9	deg C	NaN	2.9 degree_Celsius	degC
457244	USGS-01098530	1.5	deg C	NaN	1.5 degree_Celsius	degC
457248	USGS-01097050	3.9	deg C	NaN	3.9 degree_Celsius	degC

125857 rows × 6 columns

[34]:

# Examine deg F
temperature_results.loc[df['ResultMeasure/MeasureUnitCode'] == 'deg F']

[34]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Temperature	Units
19652	NALMS-F865245	81	deg F	NaN	27.222222222222285 degree_Celsius	degF
28290	11113300-HOODERD	59.2	deg F	NaN	15.111111111111143 degree_Celsius	degF
28661	11113300-HOODERD	59.1	deg F	NaN	15.0555555555556 degree_Celsius	degF
28991	11113300-HOODERD	59.9	deg F	NaN	15.500000000000057 degree_Celsius	degF
29537	11113300-HOODERD	56	deg F	NaN	13.333333333333371 degree_Celsius	degF
...	...	...	...	...	...	...
452878	11NPSWRD_WQX-SAIR_MEQ76_NC6	70.0	deg F	NaN	21.111111111111143 degree_Celsius	degF
452882	11NPSWRD_WQX-SAMA_MEQ76_NC13	72.0	deg F	NaN	22.222222222222285 degree_Celsius	degF
452891	11NPSWRD_WQX-SAMA_MEQ76_NC18	71.0	deg F	NaN	21.666666666666686 degree_Celsius	degF
452897	11NPSWRD_WQX-SAIR_MEQ76_NC3	70.0	deg F	NaN	21.111111111111143 degree_Celsius	degF
452902	11NPSWRD_WQX-SAMA_MEQ76_SB05	60.0	deg F	NaN	15.5555555555556 degree_Celsius	degF

695 rows × 6 columns

In the above we can see examples where the results were in deg F and in the result field they’ve been converted into degree_Celsius

[35]:

# Examine missing units
temperature_results.loc[df['ResultMeasure/MeasureUnitCode'].isna()]

[35]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Temperature	Units
127677	AQUINNAH-MEN TXCO	0	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	0.0 degree_Celsius	degC
127678	AQUINNAH-MEN TXCO	0	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	0.0 degree_Celsius	degC
127679	AQUINNAH-MEN TXCO	0	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	0.0 degree_Celsius	degC
127680	AQUINNAH-MEN TXCO	0	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	0.0 degree_Celsius	degC
127681	AQUINNAH-MEN TXCO	0	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	0.0 degree_Celsius	degC
...	...	...	...	...	...	...
129019	AQUINNAH-MEN PND Q	0	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	0.0 degree_Celsius	degC
129020	AQUINNAH-MEN PND Q	0	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	0.0 degree_Celsius	degC
129022	AQUINNAH-MEN PND Q	0	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	0.0 degree_Celsius	degC
143966	AQUINNAH-MEN PND HC	0.0	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	0.0 degree_Celsius	degC
421279	USGS-414654070002901	NaN	NaN	ResultMeasureValue: missing (NaN) result; Resu...	NaN	degC

218 rows × 6 columns

We can see where the units were missing, the results were assumed to be in degree_Celsius already

[36]:

# This is also noted in the QA_flag field
list(temperature_results.loc[df['ResultMeasure/MeasureUnitCode'].isna(), 'QA_flag'])[0]

[36]:

'ResultMeasure/MeasureUnitCode: MISSING UNITS, degC assumed'

[37]:

# Look for any without usable results
temperature_results.loc[df['Temperature'].isna()]

[37]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Temperature	Units
343	MERRIMACK_RIVER_WATERSHED_WQX-Hayden-Schofield	8	count	NaN	NaN	count
3512	MERRIMACK_RIVER_WATERSHED_WQX-Manchester	8	count	NaN	NaN	count
4018	MERRIMACK_RIVER_WATERSHED_WQX-Nina-Scarito	8	count	NaN	NaN	count
5102	MERRIMACK_RIVER_WATERSHED_WQX-Misserville	8	count	NaN	NaN	count
5498	MERRIMACK_RIVER_WATERSHED_WQX-Ferrous	8	count	NaN	NaN	count
...	...	...	...	...	...	...
285725	11NPSWRD_WQX-SAIR_SWC_SR3	NaN	deg C	ResultMeasureValue: missing (NaN) result	NaN	degC
301036	11NPSWRD_WQX-SAIR_SWC_SR6	NaN	deg C	ResultMeasureValue: missing (NaN) result	NaN	degC
302367	11NPSWRD_WQX-SAIR_SWC_SR7	NaN	deg C	ResultMeasureValue: missing (NaN) result	NaN	degC
306809	11NPSWRD_WQX-SAIR_SWC_SRT9	NaN	deg C	ResultMeasureValue: missing (NaN) result	NaN	degC
421279	USGS-414654070002901	NaN	NaN	ResultMeasureValue: missing (NaN) result; Resu...	NaN	degC

213 rows × 6 columns

[38]:

# Aggregate temperature data by station
visualize.station_summary(temperature_results, 'Temperature')

[38]:

	MonitoringLocationIdentifier	cnt	mean
0	11113300-00-SPB	21	10.909524
1	11113300-00F-KLY	3	20.233333
2	11113300-01-BAK	10	20.600000
3	11113300-01-BVR	1	21.000000
4	11113300-01-CTP	39	24.438462
...	...	...	...
5963	WWMD_VA-WWE1149PI	1	18.800000
5964	WWMD_VA-WWE1150PI	1	21.300000
5965	WWMD_VA-WWE1151PI	2	20.700000
5966	WWMD_VA-WWE1152PI	1	20.000000
5967	WWMD_VA-WWE1153PI	1	20.800000

5968 rows × 3 columns

[39]:

# Map number of usable results at each station
gdf_count = visualize.map_counts(temperature_results, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)

[39]:

<Axes: >

../_images/notebooks_Harmonize_CapeCod_Detailed_55_1.png

[40]:

# Map average results at each station
gdf_avg = visualize.map_measure(temperature_results, stations_clipped, 'Temperature')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)

[40]:

<Axes: >

../_images/notebooks_Harmonize_CapeCod_Detailed_56_1.png

Dissolved Oxygen (DO)

[41]:

# look at Dissolved oxygen (DO), but this time without intermediate fields
df = harmonize.harmonize(df, 'Dissolved oxygen (DO)')

Note: Imediately when we run a harmonization function without the intermediate fields they’re deleted.

[42]:

# Look at what was changed
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'DO']
do_res = df.loc[df['CharacteristicName']=='Dissolved oxygen (DO)', cols]
do_res

[42]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	DO
14	IRWA-HB	4.6	mg/L	NaN	4.6 milligram / liter
17	MYRWA-MEB001	9.4	mg/L	NaN	9.4 milligram / liter
19	NARS_WQX-MARO-1020	10.3	mg/L	NaN	10.3 milligram / liter
22	WTGHA-M46	24.7	mg/L	NaN	24.7 milligram / liter
23	MASSDEP-W2412	8.7	mg/L	NaN	8.7 milligram / liter
...	...	...	...	...	...
453729	11NPSWRD_WQX-CACO_GREAT_W	6.1	mg/L	NaN	6.1 milligram / liter
453736	11NPSWRD_WQX-SAMA_EPG_SHC	9.1	mg/L	NaN	9.1 milligram / liter
453739	11NPSWRD_WQX-SAMA_EPG_SHC	14.2	mg/L	NaN	14.2 milligram / liter
453742	11NPSWRD_WQX-SAMA_EPG_SHD	9.2	mg/L	NaN	9.2 milligram / liter
453744	11NPSWRD_WQX-SAMA_EPG_SHD	8.5	mg/L	NaN	8.5 milligram / liter

77324 rows × 5 columns

[43]:

do_res.loc[do_res['ResultMeasure/MeasureUnitCode']!='mg/l']

[43]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	DO
14	IRWA-HB	4.6	mg/L	NaN	4.6 milligram / liter
17	MYRWA-MEB001	9.4	mg/L	NaN	9.4 milligram / liter
19	NARS_WQX-MARO-1020	10.3	mg/L	NaN	10.3 milligram / liter
22	WTGHA-M46	24.7	mg/L	NaN	24.7 milligram / liter
23	MASSDEP-W2412	8.7	mg/L	NaN	8.7 milligram / liter
...	...	...	...	...	...
453729	11NPSWRD_WQX-CACO_GREAT_W	6.1	mg/L	NaN	6.1 milligram / liter
453736	11NPSWRD_WQX-SAMA_EPG_SHC	9.1	mg/L	NaN	9.1 milligram / liter
453739	11NPSWRD_WQX-SAMA_EPG_SHC	14.2	mg/L	NaN	14.2 milligram / liter
453742	11NPSWRD_WQX-SAMA_EPG_SHD	9.2	mg/L	NaN	9.2 milligram / liter
453744	11NPSWRD_WQX-SAMA_EPG_SHD	8.5	mg/L	NaN	8.5 milligram / liter

76027 rows × 5 columns

Though there were no results in %, the conversion from percent saturation (%) to mg/l is special. This equation is being improved by integrating tempertaure and pressure instead of assuming STP (see DO_saturation())

[44]:

# Aggregate data by station
visualize.station_summary(do_res, 'DO')

[44]:

	MonitoringLocationIdentifier	cnt	mean
0	11113300-00F-KLY	3	4.040000
1	11113300-01-BAK	10	5.697000
2	11113300-01-BVR	1	6.700000
3	11113300-01-CTP	37	6.945135
4	11113300-01-GOL	1	4.400000
...	...	...	...
2358	WWMD_VA-SR5	1	10.810000
2359	WWMD_VA-WF2	2	6.800000
2360	WWMD_VA-WI1	1	9.510000
2361	WWMD_VA-WR2X	2	6.000000
2362	WWMD_VA-WR5	1	8.355000

2363 rows × 3 columns

[45]:

# Map number of usable results at each station
gdf_count = visualize.map_counts(do_res, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)

[45]:

<Axes: >

../_images/notebooks_Harmonize_CapeCod_Detailed_64_1.png

[46]:

# Map average results at each station
gdf_avg = visualize.map_measure(do_res, stations_clipped, 'DO')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)

[46]:

<Axes: >

../_images/notebooks_Harmonize_CapeCod_Detailed_65_1.png

pH

[47]:

# pH, this time looking at a report
df = harmonize.harmonize(df, "pH", errors="ignore", report=True)

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'mV' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")

-Usable results-
count    100709.000000
mean          6.594211
std           2.583377
min          -3.124705
25%           5.800000
50%           6.700000
75%           7.500000
max         715.000000
dtype: float64
Unusable results: 2210
Usable results with inferred units: 215
Results outside threshold (0.0 to 22.09447506883673): 12

../_images/notebooks_Harmonize_CapeCod_Detailed_67_2.png

Note the warnings that occur when a unit is not recognized by the package. These occur even when report=False. Future versions could include these as defined units for pH, but here it wouldn’t alter results.

[48]:

df.loc[df['CharacteristicName']=='pH', ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'pH']]

[48]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	pH
6	7.3	None	NaN	7.3 dimensionless
9	8.16	None	NaN	8.16 dimensionless
26	7.99	None	NaN	7.99 dimensionless
43	6.61	None	NaN	6.61 dimensionless
45	8.23	None	NaN	8.23 dimensionless
...	...	...	...	...
457235	5.8	std units	NaN	5.8 dimensionless
457237	7.3	std units	NaN	7.3 dimensionless
457241	7.4	std units	NaN	7.4 dimensionless
457245	7.5	std units	NaN	7.5 dimensionless
457249	7.4	std units	NaN	7.4 dimensionless

102919 rows × 4 columns

‘None’ is uninterpretable and replaced with NaN, which then gets replaced with ‘dimensionless’ since pH is unitless

Salinity

[49]:

# Salinity
df = harmonize.harmonize(df, 'Salinity', report=True, errors='ignore')

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/basis.py:154: UserWarning: Mismatched ResultTemperatureBasisText: updated from 25 deg C to @25C (units)
  warn(f"Mismatched {flag}", UserWarning)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'deg C' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")

-Usable results-
count    36923.000000
mean        20.638884
std         25.204451
min          0.000000
25%          0.900000
50%         29.700000
75%         31.380000
max       4003.482834
dtype: float64
Unusable results: 218
Usable results with inferred units: 1
Results outside threshold (0.0 to 171.8655897766306): 2

../_images/notebooks_Harmonize_CapeCod_Detailed_72_2.png

[50]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity']
df.loc[df['CharacteristicName']=='Salinity', cols]

[50]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Salinity
10	30.33	ppt	NaN	30.33 Practical_Salinity_Units
51	30.48	ppt	NaN	30.48 Practical_Salinity_Units
56	33.3	ppth	NaN	33.3 Practical_Salinity_Units
72	30.91	ppt	NaN	30.91 Practical_Salinity_Units
76	0.21	ppth	NaN	0.21 Practical_Salinity_Units
...	...	...	...	...
456043	9.9	PSU	NaN	9.9 Practical_Salinity_Units
456648	4.6	PSU	NaN	4.6 Practical_Salinity_Units
456748	21.0	PSU	NaN	21.0 Practical_Salinity_Units
456761	18.7	PSU	NaN	18.7 Practical_Salinity_Units
456872	10.1	PSU	NaN	10.1 Practical_Salinity_Units

37141 rows × 4 columns

Nitrogen

[51]:

# Nitrogen
df = harmonize.harmonize(df, 'Nitrogen', report=True)

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'cm3/g' UNDEFINED UNIT for Nitrogen
  warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/pandas/core/construction.py:616: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  data = np.array(data, copy=copy)

-Usable results-
count    2409.000000
mean        6.472302
std       106.930304
min         0.000800
25%         0.074000
50%         0.242000
75%         1.000000
max      2800.000000
dtype: float64
Unusable results: 243
Usable results with inferred units: 0
Results outside threshold (0.0 to 648.0541246058881): 6

../_images/notebooks_Harmonize_CapeCod_Detailed_75_2.png

[52]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Nitrogen']
df.loc[df['CharacteristicName']=='Nitrogen', cols]

[52]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Nitrogen
840	0.926976	mg/L	NaN	0.926976 milligram / liter
912	1.571196	mg/L	NaN	1.571196 milligram / liter
919	0.816144	mg/L	NaN	0.816144 milligram / liter
1216	0.770448	mg/L	NaN	0.770448 milligram / liter
1253	0.848832	mg/L	NaN	0.848832 milligram / liter
...	...	...	...	...
457068	0.119	mg/l	NaN	0.119 milligram / liter
457126	NaN	NaN	ResultMeasureValue: missing (NaN) result; Resu...	NaN
457169	NaN	NaN	ResultMeasureValue: missing (NaN) result; Resu...	NaN
457175	NaN	NaN	ResultMeasureValue: missing (NaN) result; Resu...	NaN
457212	NaN	NaN	ResultMeasureValue: missing (NaN) result; Resu...	NaN

2652 rows × 4 columns

Conductivity

[53]:

# Conductivity
df = harmonize.harmonize(df, 'Conductivity', report=True, errors='ignore')

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'count' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")

-Usable results-
count     8694.000000
mean     23607.687800
std      17412.071345
min          0.000000
25%        612.000000
50%      27773.988852
75%      39935.000000
max      57700.000000
dtype: float64
Unusable results: 81
Usable results with inferred units: 0
Results outside threshold (0.0 to 128080.11587319421): 0

../_images/notebooks_Harmonize_CapeCod_Detailed_78_2.png

[54]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Conductivity']
df.loc[df['CharacteristicName']=='Conductivity', cols]

[54]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Conductivity
2	590	uS/cm	NaN	590.0 microsiemens / centimeter
7	43.5	mS/cm	NaN	43500.0 microsiemens / centimeter
49	349	uS/cm	NaN	349.0 microsiemens / centimeter
52	49.7	mS/cm	NaN	49700.0 microsiemens / centimeter
58	8	count	NaN	NaN
...	...	...	...	...
381877	20.579	mS/cm	NaN	20579.0 microsiemens / centimeter
381912	38.432	mS/cm	NaN	38432.0 microsiemens / centimeter
381919	37.791	mS/cm	NaN	37791.0 microsiemens / centimeter
381924	19.003	mS/cm	NaN	19003.0 microsiemens / centimeter
381943	40.652	mS/cm	NaN	40652.0 microsiemens / centimeter

8775 rows × 4 columns

Chlorophyll a

[55]:

# Chlorophyll a
df = harmonize.harmonize(df, 'Chlorophyll a', report=True)

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'ug/cm2' UNDEFINED UNIT for Chlorophyll
  warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'ppb' UNDEFINED UNIT for Chlorophyll
  warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'ug/m3' UNDEFINED UNIT for Chlorophyll
  warn("WARNING: " + problem)

-Usable results-
count    4171.000000
mean        0.581802
std         3.280718
min        -0.002400
25%         0.001900
50%         0.005200
75%         0.019000
max        92.900000
dtype: float64
Unusable results: 202
Usable results with inferred units: 9
Results outside threshold (0.0 to 20.266111880769866): 29

../_images/notebooks_Harmonize_CapeCod_Detailed_81_2.png

[56]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Chlorophyll']
df.loc[df['CharacteristicName']=='Chlorophyll a', cols]

[56]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Chlorophyll
13	4.0	mg/m3	NaN	0.004000000000000001 milligram / liter
275	5.50	ug/L	NaN	0.0055 milligram / liter
375	9.48	ug/L	NaN	0.00948 milligram / liter
683	3.0	mg/m3	NaN	0.003000000000000001 milligram / liter
971	NaN	mg/m3	ResultMeasureValue: missing (NaN) result	NaN
...	...	...	...	...
456952	0.5	ug/l	NaN	0.0005 milligram / liter
456959	2.5	ug/l	NaN	0.0025 milligram / liter
456972	0.7	ug/l	NaN	0.0007 milligram / liter
457005	1.1	ug/l	NaN	0.0011 milligram / liter
457060	0.7	ug/l	NaN	0.0007 milligram / liter

4373 rows × 4 columns

Organic Carbon

[57]:

# Organic carbon (%)
df = harmonize.harmonize(df, 'Organic carbon', report=True)

-Usable results-
count      6141.000000
mean       5287.284256
std       35592.329726
min      -90000.000000
25%           1.740000
50%           4.900000
75%           8.900000
max      530000.000000
dtype: float64
Unusable results: 79
Usable results with inferred units: 0
Results outside threshold (0.0 to 218841.26261451485): 40

../_images/notebooks_Harmonize_CapeCod_Detailed_84_1.png

[58]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Carbon']
df.loc[df['CharacteristicName']=='Organic carbon', cols]

[58]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Carbon
965	7.05	mg/L	NaN	7.05 milligram / liter
1097	7.57	mg/L	NaN	7.57 milligram / liter
1150	3.5	mg/L	NaN	3.5 milligram / liter
3027	3.8	mg/L	NaN	3.8 milligram / liter
3852	3.7	mg/L	NaN	3.7 milligram / liter
...	...	...	...	...
457056	7.01	mg/l	NaN	7.01 milligram / liter
457125	5.96	mg/l	NaN	5.96 milligram / liter
457168	4.1	mg/l	NaN	4.1 milligram / liter
457174	7.79	mg/l	NaN	7.79 milligram / liter
457211	5.79	mg/l	NaN	5.79 milligram / liter

6220 rows × 4 columns

Turbidity (NTU)

[59]:

# Turbidity (NTU)
df = harmonize.harmonize(df, 'Turbidity', report=True, errors='ignore')

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/harmonize.py:149: UserWarning: Bad Turbidity unit: count
  warn(f"Bad Turbidity unit: {unit}")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'count' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")

-Usable results-
count    29171.000000
mean        18.052962
std         78.056340
min       -999.000000
25%          1.400000
50%          2.890000
75%          6.500000
max       4100.000000
dtype: float64
Unusable results: 547
Usable results with inferred units: 275
Results outside threshold (0.0 to 486.3910018894372): 190

../_images/notebooks_Harmonize_CapeCod_Detailed_87_2.png

[60]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Turbidity']
df.loc[df['CharacteristicName']=='Turbidity', cols]

[60]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Turbidity
0	7.6	NTU	NaN	7.6 Nephelometric_Turbidity_Units
4	1	NTU	NaN	1.0 Nephelometric_Turbidity_Units
11	4.2	NTU	NaN	4.2 Nephelometric_Turbidity_Units
16	1.3	NTU	NaN	1.3 Nephelometric_Turbidity_Units
21	5.2	NTU	NaN	5.2 Nephelometric_Turbidity_Units
...	...	...	...	...
457229	1.6	NTRU	NaN	1.6 Nephelometric_Turbidity_Units
457239	5.8	NTRU	NaN	5.8 Nephelometric_Turbidity_Units
457243	2.1	NTRU	NaN	2.1 Nephelometric_Turbidity_Units
457247	2.9	NTRU	NaN	2.9 Nephelometric_Turbidity_Units
457251	2.7	NTRU	NaN	2.7 Nephelometric_Turbidity_Units

29718 rows × 4 columns

Sediment

[61]:

# Sediment
df = harmonize.harmonize(df, 'Sediment', report=False, errors='ignore')

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'g / H2O' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")

[62]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Sediment']
df.loc[df['CharacteristicName']=='Sediment', cols]

[62]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Sediment
415893	0.012	g	NaN	NaN
415894	0.0037	g	NaN	NaN
415895	0.0048	g	NaN	NaN
415896	0.001	g	NaN	NaN
415897	0.0088	g	NaN	NaN
...	...	...	...	...
456318	0.0051	g	NaN	NaN
456392	0.0025	g	NaN	NaN
456464	0.002	g	NaN	NaN
456503	0.0023	g	NaN	NaN
456549	0.0014	g	NaN	NaN

4410 rows × 4 columns

Phosphorus

Note: must be merged w/ activities (package runs query by site if not already merged)

[63]:

# Phosphorus
df = harmonize.harmonize(df, 'Phosphorus')

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/pandas/core/construction.py:616: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  data = np.array(data, copy=copy)

2 Phosphorus sample fractions not in frac_dict
2 Phosphorus sample fractions not in frac_dict found in expected domains, mapped to "Other_Phosphorus"

Note: warnings for unexpected characteristic fractions. Fractions are each seperated out into their own result column.

[64]:

# All Phosphorus
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'TDP_Phosphorus']
df.loc[df['Phosphorus'].notna(), cols]

[64]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	TDP_Phosphorus
65	0.049	mg/L	NaN	NaN
67	0.024	mg/L	NaN	NaN
71	0.05	mg/L	NaN	NaN
87	0.027	mg/L	NaN	NaN
106	0.059712	mg/L	NaN	NaN
...	...	...	...	...
456525	0.006	mg/l as P	NaN	0.006 milligram / liter
456537	0.016	mg/l as P	NaN	NaN
456538	0.008	mg/l as P	NaN	0.008 milligram / liter
456545	0.047	mg/l as P	NaN	NaN
456546	0.028	mg/l as P	NaN	0.028 milligram / liter

15994 rows × 4 columns

[65]:

# Total phosphorus
df.loc[df['TP_Phosphorus'].notna(), cols]

[65]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	TDP_Phosphorus
65	0.049	mg/L	NaN	NaN
67	0.024	mg/L	NaN	NaN
71	0.05	mg/L	NaN	NaN
87	0.027	mg/L	NaN	NaN
106	0.059712	mg/L	NaN	NaN
...	...	...	...	...
456499	0.036	mg/l as P	NaN	NaN
456510	0.021	mg/l as P	NaN	NaN
456524	0.011	mg/l as P	NaN	NaN
456537	0.016	mg/l as P	NaN	NaN
456545	0.047	mg/l as P	NaN	NaN

10721 rows × 4 columns

[66]:

# Total dissolved phosphorus
df.loc[df['TDP_Phosphorus'].notna(), cols]

[66]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	TDP_Phosphorus
46491	0.023	ppm	NaN	0.023000000000000003 milligram / liter
50426	0.035	ppm	NaN	0.03500000000000001 milligram / liter
52416	0.017	ppm	NaN	0.017000000000000005 milligram / liter
52543	0.015	ppm	NaN	0.015000000000000003 milligram / liter
53907	0.015	ppm	NaN	0.015000000000000003 milligram / liter
...	...	...	...	...
456500	0.015	mg/l as P	NaN	0.015 milligram / liter
456511	0.005	mg/l as P	NaN	0.005 milligram / liter
456525	0.006	mg/l as P	NaN	0.006 milligram / liter
456538	0.008	mg/l as P	NaN	0.008 milligram / liter
456546	0.028	mg/l as P	NaN	0.028 milligram / liter

4857 rows × 4 columns

[67]:

# All other phosphorus sample fractions
df.loc[df['Other_Phosphorus'].notna(), cols]

[67]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	TDP_Phosphorus
35395	0.03525375	mg/L	NaN	NaN
35464	0.107028125	mg/L	NaN	NaN
35973	0.0697675	mg/L	NaN	NaN
36237	0.04356	mg/L	NaN	NaN
36366	0.03654875	mg/L	NaN	NaN
...	...	...	...	...
446652	530.0	mg/kg	NaN	NaN
447989	280.0	mg/kg	NaN	NaN
448763	270.0	mg/kg	NaN	NaN
449009	260.0	mg/kg	NaN	NaN
449278	1300.0	mg/kg	NaN	NaN

416 rows × 4 columns

Bacteria

Some equivalence assumptions are built-in where bacteria counts that are not equivalent are treated as such because there is no standard way to convert from one to another.

Fecal Coliform

[68]:

# Known unit with bad dimensionality ('Colony_Forming_Units * milliliter')
df = harmonize.harmonize(df, 'Fecal Coliform', report=True, errors='ignore')

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")

-Usable results-
count      2462.000000
mean        910.512591
std        6103.365912
min           0.000000
25%          24.000000
50%          93.000000
75%         400.000000
max      250000.000000
dtype: float64
Unusable results: 5762
Usable results with inferred units: 1
Results outside threshold (0.0 to 37530.70806108734): 7

../_images/notebooks_Harmonize_CapeCod_Detailed_103_2.png

[69]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Fecal_Coliform']
df.loc[df['CharacteristicName']=='Fecal Coliform', cols]

[69]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Fecal_Coliform
9833	40	#/100mL	NaN	NaN
9885	9	#/100mL	NaN	NaN
9900	20	#/100mL	NaN	NaN
10109	200	#/100mL	NaN	NaN
10305	NO DATA	#/100mL	ResultMeasureValue: "NO DATA" result cannot be...	NaN
...	...	...	...	...
452883	2400.0	cfu/100mL	NaN	NaN
452885	230.0	cfu/100mL	NaN	NaN
452893	30.0	cfu/100mL	NaN	NaN
452901	NaN	cfu/100mL	ResultMeasureValue: missing (NaN) result	NaN
453798	1.0	cfu/100ml	NaN	1.0 Colony_Forming_Units / milliliter

8224 rows × 4 columns

Excherichia Coli

[70]:

# Known unit with bad dimensionality ('Colony_Forming_Units * milliliter')
df = harmonize.harmonize(df, 'Escherichia coli', report=True, errors='ignore')

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'CFUcol/100mL' UNDEFINED UNIT for E_coli
  warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: '%' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'count' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")

-Usable results-
count     5189.000000
mean       245.882382
std       1752.315000
min          0.000000
25%          0.500000
50%         30.000000
75%        128.000000
max      72000.000000
dtype: float64
Unusable results: 22197
Usable results with inferred units: 0
Results outside threshold (0.0 to 10759.772381268724): 14

../_images/notebooks_Harmonize_CapeCod_Detailed_106_2.png

[71]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'E_coli']
df.loc[df['CharacteristicName']=='Escherichia coli', cols]

[71]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	E_coli
15	96	MPN/100mL	NaN	NaN
18	58	#/100mL	NaN	NaN
24	NaN	MPN/100mL	ResultMeasureValue: missing (NaN) result	NaN
37	24200	MPN/100mL	NaN	NaN
38	7270	MPN/100mL	NaN	NaN
...	...	...	...	...
457225	7.0	MPN/100 ml	NaN	7.0 Colony_Forming_Units / milliliter
457238	870.0	MPN/100 ml	NaN	870.0 Colony_Forming_Units / milliliter
457242	44.0	MPN/100 ml	NaN	44.0 Colony_Forming_Units / milliliter
457246	14.0	MPN/100 ml	NaN	14.0 Colony_Forming_Units / milliliter
457250	390.0	MPN/100 ml	NaN	390.0 Colony_Forming_Units / milliliter

27386 rows × 4 columns

Combining Salinity and Conductivity

Convert module has various functions to convert from one unit or characteristic to another. Some of these are used within a single characteristic during harmonization (e.g. DO saturation to concentration) while others are intended to model one characteristic as an indicator of another (e.g. estimate salinity from conductivity).

Note: this should only be done after both characteristic fields have been harmonized. Results before and after should be inspected, thresholds for outliers applied, and consider adding a QA_flag for modeled data.

Explore Salinity results:

[72]:

from harmonize_wq import convert

[73]:

# First note initial Salinity info
lst = [x.magnitude for x in list(df['Salinity'].dropna())]
q_sum = sum(lst)
print('Range: {} to {}'.format(min(lst), max(lst)))
print('Results: {} \nMean: {} PSU'.format(len(lst), q_sum/len(lst)))

Range: 0.0 to 4003.4828342857154
Results: 36923
Mean: 20.63888369988916 PSU

[74]:

# Identify extreme outliers
[x for x in lst if x >3200]

[74]:

[4003.4828342857154]

Other fields like units and QA_flag may help understand what caused high values and what results might need to be dropped from consideration

[75]:

# Columns to focus on
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity']

[76]:

# Look at important fields for max 5 values
salinity_series = df['Salinity'][df['Salinity'].notna()]
salinity_series.sort_values(ascending=False, inplace=True)
df[cols][df['Salinity'].isin(salinity_series[0:5])]

[76]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Salinity
28273	804	ppth	NaN	804.0 Practical_Salinity_Units
143522	71.49	ppth	NaN	71.49 Practical_Salinity_Units
144531	70.62	ppth	NaN	70.62 Practical_Salinity_Units
213318	77.666667	ppt	NaN	77.6666666666667 Practical_Salinity_Units
424274	4980.0	mg/mL @25C	ResultTemperatureBasisText: updated from 25 de...	4003.4828342857154 Practical_Salinity_Units

Detection limits may help understand what caused low values and what results might need to be dropped or updated

[77]:

df = wrangle.add_detection(df, 'Salinity')
cols+=['ResultDetectionConditionText',
       'DetectionQuantitationLimitTypeName',
       'DetectionQuantitationLimitMeasure/MeasureValue',
       'DetectionQuantitationLimitMeasure/MeasureUnitCode']

[78]:

# Look at important fields for min 5 values (often multiple 0.0)
df[cols][df['Salinity'].isin(salinity_series[-5:])]

[78]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Salinity	ResultDetectionConditionText	DetectionQuantitationLimitTypeName	DetectionQuantitationLimitMeasure/MeasureValue	DetectionQuantitationLimitMeasure/MeasureUnitCode
13197	0	ppt	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
13781	0	PSS	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
13968	0	PSS	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
20944	0	ppm	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
22582	0	ppm	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...
363590	0.0	ppth	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
363646	0.0	ppth	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
363692	0.0	ppth	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
363738	0.0	ppth	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
363750	0.0	ppth	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN

649 rows × 8 columns

Explore conductivity results:

[79]:

# Create series for Conductivity values
cond_series = df['Conductivity'].dropna()
cond_series

[79]:

2           590.0 microsiemens / centimeter
7         43500.0 microsiemens / centimeter
49          349.0 microsiemens / centimeter
52        49700.0 microsiemens / centimeter
83          443.0 microsiemens / centimeter
                        ...
381877    20579.0 microsiemens / centimeter
381912    38432.0 microsiemens / centimeter
381919    37791.0 microsiemens / centimeter
381924    19003.0 microsiemens / centimeter
381943    40652.0 microsiemens / centimeter
Name: Conductivity, Length: 8694, dtype: object

Conductivity thresholds from Freshwater Explorer: 10 > x < 5000 us/cm, use a higher threshold for coastal waters

[80]:

# Sort and check other relevant columns before converting (e.g. Salinity)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity', 'Conductivity']
df.sort_values(by=['Conductivity'], ascending=False, inplace=True)
df.loc[df['Conductivity'].notna(), cols]

[80]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Salinity	Conductivity
218963	57700.0	uS/cm	NaN	NaN	57700.0 microsiemens / centimeter
270752	52.418	mS/cm	NaN	NaN	52418.0 microsiemens / centimeter
280935	51.618	mS/cm	NaN	NaN	51618.0 microsiemens / centimeter
268321	51.325	mS/cm	NaN	NaN	51325.0 microsiemens / centimeter
51000	51.05	mS/cm	NaN	NaN	51050.0 microsiemens / centimeter
...	...	...	...	...	...
6304	0.02	uS/cm	NaN	NaN	0.02 microsiemens / centimeter
105239	0	mS/cm	NaN	NaN	0.0 microsiemens / centimeter
8095	0	uS/cm	NaN	NaN	0.0 microsiemens / centimeter
105324	0	mS/cm	NaN	NaN	0.0 microsiemens / centimeter
3825	0	uS/cm	NaN	NaN	0.0 microsiemens / centimeter

8694 rows × 5 columns

[81]:

# Convert values to PSU and write to Salinity
cond_series = cond_series.apply(str)  # Convert to string to convert to dimensionless (PSU)
df.loc[df['Conductivity'].notna(), 'Salinity'] = cond_series.apply(convert.conductivity_to_PSU)
df.loc[df['Conductivity'].notna(), 'Salinity']

[81]:

218963    38.468 dimensionless
270752    34.521 dimensionless
280935    33.929 dimensionless
268321    33.713 dimensionless
51000      33.51 dimensionless
                  ...
6304       0.012 dimensionless
105239     0.012 dimensionless
8095       0.012 dimensionless
105324     0.012 dimensionless
3825       0.012 dimensionless
Name: Salinity, Length: 8694, dtype: object

Datetime

datetime() formats time using dataretrieval and ActivityStart

[82]:

# First inspect the existing unformated fields
cols = ['ActivityStartDate', 'ActivityStartTime/Time', 'ActivityStartTime/TimeZoneCode']
df[cols]

[82]:

	ActivityStartDate	ActivityStartTime/Time	ActivityStartTime/TimeZoneCode
218963	2021-06-21	NaN	NaN
270752	2022-08-09	12:58:00	EST
280935	2022-08-09	13:25:00	EST
268321	2022-07-25	11:40:00	EST
51000	2016-08-15	NaN	NaN
...	...	...	...
457247	2024-01-08	12:00:00	EST
457248	2024-02-12	08:45:00	EST
457249	2024-02-12	08:45:00	EST
457250	2024-02-12	08:45:00	EST
457251	2024-02-12	08:45:00	EST

457252 rows × 3 columns

[83]:

# 'ActivityStartDate' presserves date where 'Activity_datetime' is NAT due to no time zone
df = clean.datetime(df)
df[['ActivityStartDate', 'Activity_datetime']]

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/dataretrieval/utils.py:87: UserWarning: Warning: 167291 incomplete dates found, consider setting datetime_index to False.
  warnings.warn(

[83]:

	ActivityStartDate	Activity_datetime
218963	2021-06-21	NaT
270752	2022-08-09	2022-08-09 17:58:00+00:00
280935	2022-08-09	2022-08-09 18:25:00+00:00
268321	2022-07-25	2022-07-25 16:40:00+00:00
51000	2016-08-15	NaT
...	...	...
457247	2024-01-08	2024-01-08 17:00:00+00:00
457248	2024-02-12	2024-02-12 13:45:00+00:00
457249	2024-02-12	2024-02-12 13:45:00+00:00
457250	2024-02-12	2024-02-12 13:45:00+00:00
457251	2024-02-12	2024-02-12 13:45:00+00:00

457252 rows × 2 columns

Activity_datetime combines all three time component columns into UTC. If time is missing this is NaT so a ActivityStartDate column is used to preserve date only.

Depth

Note: Data are often lacking sample depth metadata

[84]:

# Depth of sample (default units='meter')
df = clean.harmonize_depth(df)
#df.loc[df['ResultDepthHeightMeasure/MeasureValue'].dropna(), "Depth"]
df['ResultDepthHeightMeasure/MeasureValue'].dropna()

[84]:

110874    0.15
112336    0.15
73386     0.15
78101     0.15
111338    0.15
          ...
320473    0.15
320504    0.15
320515    0.15
320573    0.15
320587    0.15
Name: ResultDepthHeightMeasure/MeasureValue, Length: 2460, dtype: float64

Characteristic to Column (long to wide format)

[85]:

# Split single QA column into multiple by characteristic (rename the result to preserve these QA_flags)
df2 = wrangle.split_col(df)
df2

[85]:

	OrganizationIdentifier	OrganizationFormalName	ActivityIdentifier	ActivityStartDate	ActivityStartTime/Time	ActivityStartTime/TimeZoneCode	MonitoringLocationIdentifier	ResultIdentifier	DataLoggerLine	ResultDetectionConditionText	...	QA_Other_Phosphorus	QA_E_coli	QA_Carbon	QA_Temperature	QA_Chlorophyll	QA_Secchi	QA_Sediment	QA_Turbidity	QA_Nitrogen	QA_Fecal_Coliform
218963	NARS_WQX	EPA National Aquatic Resources Survey (NARS)	NARS_WQX-185450_2021	2021-06-21	NaN	NaN	NARS_WQX-NWC_RI-10033	STORET-1040684142	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
270752	WTGHA	Wompanoag Tribe of Gay Head Aquinnah (Tribal)	WTGHA-P3:20220809125800:FM:B	2022-08-09	12:58:00	-0500	WTGHA-P3	STORET-1041330082	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
280935	WTGHA	Wompanoag Tribe of Gay Head Aquinnah (Tribal)	WTGHA-P4:20220809132500:FM:B	2022-08-09	13:25:00	-0500	WTGHA-P4	STORET-1041330098	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
268321	WTGHA	Wompanoag Tribe of Gay Head Aquinnah (Tribal)	WTGHA-P3:20220725114000:FM:B	2022-07-25	11:40:00	-0500	WTGHA-P3	STORET-1041329903	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
51000	WTGHA	Wompanoag Tribe of Gay Head Aquinnah (Tribal)	WTGHA-P3:20160815:FM:B	2016-08-15	NaN	NaN	WTGHA-P3	STORET-1041309861	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
457247	USGS-MA	USGS Massachusetts Water Science Center	nwisma.01.02400495	2024-01-08	12:00:00	-0500	USGS-01098530	NWIS-126836828	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
457248	USGS-MA	USGS Massachusetts Water Science Center	nwisma.01.02400639	2024-02-12	08:45:00	-0500	USGS-01097050	NWIS-126989046	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
457249	USGS-MA	USGS Massachusetts Water Science Center	nwisma.01.02400639	2024-02-12	08:45:00	-0500	USGS-01097050	NWIS-126989056	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
457250	USGS-MA	USGS Massachusetts Water Science Center	nwisma.01.02400639	2024-02-12	08:45:00	-0500	USGS-01097050	NWIS-126989063	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
457251	USGS-MA	USGS Massachusetts Water Science Center	nwisma.01.02400639	2024-02-12	08:45:00	-0500	USGS-01097050	NWIS-126989064	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

419173 rows × 118 columns

[86]:

# This expands the single col (QA_flag) out to a number of new columns based on the unique characteristicNames and speciation
print('{} new columns'.format(len(df2.columns) - len(df.columns)))

15 new columns

[87]:

# Note: there are fewer rows because NAN results are also dropped in this step
print('{} fewer rows'.format(len(df)-len(df2)))

38079 fewer rows

[88]:

#Examine Carbon flags from earlier in notebook (note these are empty now because NAN is dropped)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'Carbon', 'QA_Carbon']
df2.loc[df2['QA_Carbon'].notna(), cols]

[88]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	Carbon	QA_Carbon

Next the table is divided into the columns of interest (main_df) and characteristic specific metadata (chars_df)

[89]:

# split table into main and characteristics tables
main_df, chars_df = wrangle.split_table(df2)

[90]:

# Columns still in main table
main_df.columns

[90]:

Index(['OrganizationIdentifier', 'OrganizationFormalName',
       'ActivityIdentifier', 'MonitoringLocationIdentifier', 'ProviderName',
       'Secchi', 'Temperature', 'DO', 'pH', 'Salinity', 'Nitrogen',
       'Speciation', 'TOTAL NITROGEN_ MIXED FORMS', 'Conductivity',
       'Chlorophyll', 'Carbon', 'Turbidity', 'Sediment', 'Phosphorus',
       'TP_Phosphorus', 'TDP_Phosphorus', 'Other_Phosphorus', 'Fecal_Coliform',
       'E_coli', 'DetectionQuantitationLimitTypeName',
       'DetectionQuantitationLimitMeasure/MeasureValue',
       'DetectionQuantitationLimitMeasure/MeasureUnitCode',
       'Activity_datetime', 'Depth', 'QA_Salinity', 'QA_pH', 'QA_Conductivity',
       'QA_DO', 'QA_TP_Phosphorus', 'QA_TDP_Phosphorus', 'QA_Other_Phosphorus',
       'QA_E_coli', 'QA_Carbon', 'QA_Temperature', 'QA_Chlorophyll',
       'QA_Secchi', 'QA_Sediment', 'QA_Turbidity', 'QA_Nitrogen',
       'QA_Fecal_Coliform'],
      dtype='object')

[91]:

# look at main table results (first 5)
main_df.head()

[91]:

	OrganizationIdentifier	OrganizationFormalName	ActivityIdentifier	MonitoringLocationIdentifier	ProviderName	Secchi	Temperature	DO	pH	Salinity	...	QA_Other_Phosphorus	QA_E_coli	QA_Carbon	QA_Temperature	QA_Chlorophyll	QA_Secchi	QA_Sediment	QA_Turbidity	QA_Nitrogen	QA_Fecal_Coliform
218963	NARS_WQX	EPA National Aquatic Resources Survey (NARS)	NARS_WQX-185450_2021	NARS_WQX-NWC_RI-10033	STORET	NaN	NaN	NaN	NaN	38.468 dimensionless	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
270752	WTGHA	Wompanoag Tribe of Gay Head Aquinnah (Tribal)	WTGHA-P3:20220809125800:FM:B	WTGHA-P3	STORET	NaN	NaN	NaN	NaN	34.521 dimensionless	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
280935	WTGHA	Wompanoag Tribe of Gay Head Aquinnah (Tribal)	WTGHA-P4:20220809132500:FM:B	WTGHA-P4	STORET	NaN	NaN	NaN	NaN	33.929 dimensionless	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
268321	WTGHA	Wompanoag Tribe of Gay Head Aquinnah (Tribal)	WTGHA-P3:20220725114000:FM:B	WTGHA-P3	STORET	NaN	NaN	NaN	NaN	33.713 dimensionless	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
51000	WTGHA	Wompanoag Tribe of Gay Head Aquinnah (Tribal)	WTGHA-P3:20160815:FM:B	WTGHA-P3	STORET	NaN	NaN	NaN	NaN	33.51 dimensionless	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 45 columns

[92]:

# Empty columns that could be dropped (Mostly QA columns)
cols = list(main_df.columns)
x = main_df.dropna(axis=1, how='all')
[col for col in cols if col not in x.columns]

[92]:

['Sediment',
 'DetectionQuantitationLimitTypeName',
 'DetectionQuantitationLimitMeasure/MeasureValue',
 'DetectionQuantitationLimitMeasure/MeasureUnitCode',
 'QA_Conductivity',
 'QA_DO',
 'QA_TDP_Phosphorus',
 'QA_Carbon',
 'QA_Sediment']

[93]:

# Map average results at each station
gdf_avg = visualize.map_measure(main_df, stations_clipped, 'Temperature')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)

[93]:

<Axes: >

../_images/notebooks_Harmonize_CapeCod_Detailed_144_1.png