Pensacola Bay FL - Detailed step-by-step

Standardize, clean and wrangle Water Quality Portal data in Pensacola and Perdido Bays into more analytic-ready formats using the harmonize_wq package

US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonixe_wq package is intended to be a flexible water quality specific framework to help:

Identify differences in data units (including speciation and basis)
Identify differences in sampling or analytic methods
Resolve data errors using transparent assumptions
Reduce data to the columns that are most commonly needed
Transform data from long to wide format

Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.

Detailed step-by-step workflow

This example workflow takes a deeper dive into some of the expanded functionality to examine results for different water quality parameters in Pensacola and Perdido Bays

Install and import the required libraries

[1]:

import sys
#!python -m pip uninstall harmonize-wq --yes
#!python -m pip install harmonize-wq --yes
# Use pip to install the package from pypi or the latest from github
#!{sys.executable} -m pip install harmonize-wq
# For latest dev version
#!{sys.executable} -m pip install git+https://github.com/USEPA/harmonize-wq.git@new_release_0-3-8

[2]:

import dataretrieval.wqp as wqp
from harmonize_wq import wrangle
from harmonize_wq import location
from harmonize_wq import harmonize
from harmonize_wq import visualize
from harmonize_wq import clean

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/dataretrieval/nadp.py:44: UserWarning: GDAL not installed. Some functions will not work.
  warnings.warn('GDAL not installed. Some functions will not work.')

Download location data using dataretrieval

[3]:

# Read geometry for Area of Interest from geojson file url and plot
aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson'
aoi_gdf = wrangle.as_gdf(aoi_url).to_crs(epsg=4326)  # already standard 4326
aoi_gdf.plot()

[3]:

<Axes: >

../_images/notebooks_Harmonize_Pensacola_Detailed_9_1.png

[4]:

# Note there are actually two polygons (one for each Bay)
aoi_gdf
# Spatial query parameters can be updated to run just one
bBox = wrangle.get_bounding_box(aoi_gdf)
# For only one bay, e.g., first is Pensacola Bay:
#bBox = wrangle.get_bounding_box(aoi_gdf, 0)

[5]:

# Build query with characteristicNames and the AOI extent
query = {'characteristicName': ['Phosphorus',
                                'Temperature, water',
                                'Depth, Secchi disk depth',
                                'Dissolved oxygen (DO)',
                                'Salinity',
                                'pH',
                                'Nitrogen',
                                'Conductivity',
                                'Organic carbon',
                                'Chlorophyll a',
                                'Turbidity',
                                'Sediment',
                                'Fecal Coliform',
                                'Escherichia coli']}
query['bBox'] = bBox

[6]:

# Query stations (can be slow)
stations, site_md = wqp.what_sites(**query)

[7]:

# Rows and columns for results
stations.shape

[7]:

(2863, 37)

[8]:

# First 5 rows
stations.head()

[8]:

	OrganizationIdentifier	OrganizationFormalName	MonitoringLocationIdentifier	MonitoringLocationName	MonitoringLocationTypeName	MonitoringLocationDescriptionText	HUCEightDigitCode	DrainageAreaMeasure/MeasureValue	DrainageAreaMeasure/MeasureUnitCode	ContributingDrainageAreaMeasure/MeasureValue	...	AquiferName	LocalAqfrName	FormationTypeText	AquiferTypeName	ConstructionDateText	WellDepthMeasure/MeasureValue	WellDepthMeasure/MeasureUnitCode	WellHoleDepthMeasure/MeasureValue	WellHoleDepthMeasure/MeasureUnitCode	ProviderName
0	USGS-AL	USGS Alabama Water Science Center	USGS-02376115	ELEVENMILE CREEK NR WEST PENSACOLA, FL	Stream	NaN	3140107.0	27.8	sq mi	27.8	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
1	USGS-AL	USGS Alabama Water Science Center	USGS-02377570	STYX RIVER NEAR ELSANOR, AL.	Stream	NaN	3140106.0	192.0	sq mi	192.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
2	USGS-AL	USGS Alabama Water Science Center	USGS-02377920	BLACKWATER RIVER AT US HWY 90 NR ROBERTSDALE, AL.	Stream	NaN	3140106.0	23.1	sq mi	23.1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
3	USGS-AL	USGS Alabama Water Science Center	USGS-02377960	BLACKWATER RIVER AT CO RD 87 NEAR ELSANOR, AL.	Stream	NaN	3140106.0	56.6	sq mi	56.6	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
4	USGS-AL	USGS Alabama Water Science Center	USGS-02377975	BLACKWATER RIVER ABOVE SEMINOLE AL	Stream	NaN	3140106.0	40.2	sq mi	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS

5 rows × 37 columns

[9]:

# Columns used for an example row
stations.iloc[0][['HorizontalCoordinateReferenceSystemDatumName', 'LatitudeMeasure', 'LongitudeMeasure']]

[9]:

HorizontalCoordinateReferenceSystemDatumName        NAD83
LatitudeMeasure                                 30.498252
LongitudeMeasure                               -87.335809
Name: 0, dtype: object

[10]:

# Harmonize location datums to 4326 (Note we keep intermediate columns using intermediate_columns=True)
stations_gdf = location.harmonize_locations(stations, out_EPSG=4326, intermediate_columns=True)

[11]:

location.harmonize_locations?

[12]:

# Rows and columns for results after running the function (5 new columns, only 2 new if intermediate_columns=False)
stations_gdf.shape

[12]:

(2863, 42)

[13]:

# Example results for the new columns
stations_gdf.iloc[0][['geom_orig', 'EPSG', 'QA_flag', 'geom', 'geometry']]

[13]:

geom_orig         (-87.3358086, 30.49825159)
EPSG                                  4269.0
QA_flag                                  NaN
geom         POINT (-87.3358086 30.49825159)
geometry     POINT (-87.3358086 30.49825159)
Name: 0, dtype: object

[14]:

# geom and geometry look the same but geometry is a special datatype
stations_gdf['geometry'].dtype

[14]:

<geopandas.array.GeometryDtype at 0x7f9be0376910>

[15]:

# Look at the different QA_flag flags that have been assigned,
# e.g., for bad datums or limited decimal precision
set(stations_gdf.loc[stations_gdf['QA_flag'].notna()]['QA_flag'])

[15]:

{'HorizontalCoordinateReferenceSystemDatumName: Bad datum OTHER, EPSG:4326 assumed',
 'HorizontalCoordinateReferenceSystemDatumName: Bad datum UNKWN, EPSG:4326 assumed',
 'LatitudeMeasure: Imprecise: lessthan3decimaldigits',
 'LatitudeMeasure: Imprecise: lessthan3decimaldigits; HorizontalCoordinateReferenceSystemDatumName: Bad datum UNKWN, EPSG:4326 assumed',
 'LatitudeMeasure: Imprecise: lessthan3decimaldigits; LongitudeMeasure: Imprecise: lessthan3decimaldigits',
 'LongitudeMeasure: Imprecise: lessthan3decimaldigits',
 'LongitudeMeasure: Imprecise: lessthan3decimaldigits; HorizontalCoordinateReferenceSystemDatumName: Bad datum UNKWN, EPSG:4326 assumed'}

[16]:

# Map it
stations_gdf.plot()

[16]:

<Axes: >

../_images/notebooks_Harmonize_Pensacola_Detailed_22_1.png

[17]:

# Clip to area of interest
stations_clipped = wrangle.clip_stations(stations_gdf, aoi_gdf)

[18]:

# Map it
stations_clipped.plot()

[18]:

<Axes: >

../_images/notebooks_Harmonize_Pensacola_Detailed_24_1.png

[19]:

# How many stations now?
len(stations_clipped)

[19]:

[20]:

# To save the results to a shapefile
#import os
#path = ''  #specify the path (folder/directory) to save it to
#stations_clipped.to_file(os.path.join(path, 'PPBEP_stations.shp'))

Retrieve Characteristic Data

[21]:

# Now query for results
query['dataProfile'] = 'narrowResult'
res_narrow, md_narrow = wqp.get_results(**query)

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/dataretrieval/wqp.py:83: DtypeWarning: Columns (10,13,15,17,19,20,21,22,23,28,31,33,34,36,58,60,61,64,65,69,70,71,72,73) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(StringIO(response.text), delimiter=',')

[22]:

df = res_narrow
df

[22]:

	OrganizationIdentifier	OrganizationFormalName	ActivityIdentifier	ActivityStartDate	ActivityStartTime/Time	ActivityStartTime/TimeZoneCode	MonitoringLocationIdentifier	ResultIdentifier	DataLoggerLine	ResultDetectionConditionText	...	AnalysisEndTime/TimeZoneCode	ResultLaboratoryCommentCode	ResultLaboratoryCommentText	ResultDetectionQuantitationLimitUrl	LaboratoryAccreditationIndicator	LaboratoryAccreditationAuthorityName	TaxonomistAccreditationIndicator	TaxonomistAccreditationAuthorityName	LabSamplePreparationUrl	ProviderName
0	21FLSEAS_WQX	Florida Department of Environmental Protection	21FLSEAS_WQX-027950424132	2013-04-24	09:01:00	EST	21FLSEAS_WQX-02SEAS795	STORET-310551339	NaN	NaN	...	NaN	NaN	NaN	https://www.waterqualitydata.us/data/providers...	NaN	NaN	NaN	NaN	NaN	STORET
1	21FLSEAS_WQX	Florida Department of Environmental Protection	21FLSEAS_WQX-027400613134	2013-06-13	10:01:00	EST	21FLSEAS_WQX-02SEAS740	STORET-310489836	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	STORET
2	21FLPNS_WQX	FL Dept. of Environmental Protection, Northwes...	21FLPNS_WQX-1536988F1	2013-09-17	11:01:00	EST	21FLPNS_WQX-33030019	STORET-308146602	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	STORET
3	21FLPNS_WQX	FL Dept. of Environmental Protection, Northwes...	21FLPNS_WQX-1520810L	2013-07-23	10:01:00	EST	21FLPNS_WQX-33020146	STORET-308157908	NaN	NaN	...	NaN	NaN	NaN	https://www.waterqualitydata.us/data/providers...	NaN	NaN	NaN	NaN	NaN	STORET
4	21FLCBA_WQX	CHOCTAWHATCHEE BASIN ALLIANCE	21FLCBA_WQX-BAS219848-162813	2013-09-23	17:15:00	CST	21FLCBA_WQX-BAS02	STORET-760593202	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	STORET
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
418454	USGS-FL	USGS Florida Water Science Center	nwisfl.01.95900924	1959-02-04	NaN	NaN	USGS-303745086442101	NWIS-98928104	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
418455	USGS-FL	USGS Florida Water Science Center	nwisfl.01.95800572	1958-01-14	09:20:00	CST	USGS-02376108	NWIS-6891392	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
418456	USGS-FL	USGS Florida Water Science Center	nwisfl.01.95800572	1958-01-14	09:20:00	CST	USGS-02376108	NWIS-6891396	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
418457	USGS-FL	USGS Florida Water Science Center	nwisfl.01.95900926	1959-02-05	NaN	NaN	USGS-303820086241802	NWIS-6916678	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS
418458	USGS-FL	USGS Florida Water Science Center	nwisfl.01.95900852	1959-02-02	NaN	NaN	USGS-302330086482001	NWIS-6916402	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NWIS

418459 rows × 78 columns

[23]:

# Map number of usable results at each station
gdf_count = visualize.map_counts(df, stations_clipped)
legend_kwds = {"fmt": "{:.0f}", 'bbox_to_anchor':(1, 0.75)}
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)

[23]:

<Axes: >

../_images/notebooks_Harmonize_Pensacola_Detailed_30_1.png

Harmonize Characteristic Results

Two options for functions to harmonize characteristics: harmonize_all() or harmonize_generic(). harmonize_all runs functions on all characteristics and lets you specify how to handle errors harmonize_generic runs functions only on the characteristic specified with char_val and lets you also choose output units, to keep intermediate columns and to do a quick report summarizing changes.

[24]:

# See Documentation
#harmonize.harmonize_all?
#harmonize.harmonize?

secchi disk depth

[25]:

# Each harmonize function has optional params, e.g., char_val is the characticName column value to use so we can send the entire df.
# Optional params: units='m', char_val='Depth, Secchi disk depth', out_col='Secchi', report=False)

# We start by demonstrating on secchi disk depth (units default to m, keep intermediate fields, see report)
df = harmonize.harmonize(df, 'Depth, Secchi disk depth', intermediate_columns=True, report=True)

-Usable results-
count    14183.000000
mean         1.202800
std          2.343619
min          0.000000
25%          0.600000
50%          1.000000
75%          1.500000
max        260.000000
dtype: float64
Unusable results: 79
Usable results with inferred units: 0
Results outside threshold (0.0 to 15.264515528912716): 1

../_images/notebooks_Harmonize_Pensacola_Detailed_35_1.png

The threshold is based on standard deviations and is currently only used in the histogram.

[26]:

# Look at a table of just Secchi results and focus on subset of columns
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Units']
sechi_results = df.loc[df['CharacteristicName']=='Depth, Secchi disk depth', cols + ['Secchi']]
sechi_results

[26]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Units	Secchi
83	21AWIC-7290	.94	m	NaN	m	0.94 meter
95	21FLPNS_WQX-33020JF1	0.60	m	NaN	m	0.6 meter
112	21FLGW_WQX-3565	.3	m	NaN	m	0.3 meter
122	21FLBFA_WQX-33010016	1.5	m	NaN	m	1.5 meter
125	21FLBFA_WQX-33010030	1.25	m	NaN	m	1.25 meter
...	...	...	...	...	...	...
418310	21FLESC_WQX-1004D-24Q1B	2.8	m	NaN	m	2.8 meter
418327	21FLPNS_WQX-G5NW0163	0.3	m	NaN	m	0.3 meter
418334	21FLESC_WQX-797-24Q1A	1.3	m	NaN	m	1.3 meter
418338	21FLPNS_WQX-G4NW0022	0.58	m	NaN	m	0.58 meter
418340	21FLESC_WQX-GRANDEBAYOU50	2.2	m	NaN	m	2.2 meter

14262 rows × 6 columns

[27]:

# Look at unusable(NAN) results
sechi_results.loc[df['Secchi'].isna()]

[27]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Units	Secchi
164530	21FLKWAT_WQX-OKA-CBA-GAP-3-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
164544	21FLCBA_WQX-OKA-CB-BASS-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
164872	21FLCBA_WQX-OKA-CBA-GAP-3-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
165070	21FLCBA_WQX-OKA-CB-BASS-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
166146	21FLKWAT_WQX-OKA-CB-BASS-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
...	...	...	...	...	...	...
406356	21FLKWAT_WQX-SAN-SKI WATCH-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
406469	21FLCBA_WQX-OKA-CBA-GAP-1-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
407067	21FLCBA_WQX-OKA-CB-BASS-2	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
407084	21FLCBA_WQX-OKA-CBA-GAP-3-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
407462	21FLKWAT_WQX-SAN-SKI WATCH-5	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN

79 rows × 6 columns

[28]:

# look at the QA flag for first row from above
list(sechi_results.loc[df['Secchi'].isna()]['QA_flag'])[0]

[28]:

'ResultMeasureValue: "Not Reported" result cannot be used; ResultMeasure/MeasureUnitCode: MISSING UNITS, m assumed'

[29]:

# All cases where there was a QA flag
sechi_results.loc[df['QA_flag'].notna()]

[29]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Units	Secchi
164530	21FLKWAT_WQX-OKA-CBA-GAP-3-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
164544	21FLCBA_WQX-OKA-CB-BASS-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
164872	21FLCBA_WQX-OKA-CBA-GAP-3-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
165070	21FLCBA_WQX-OKA-CB-BASS-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
166146	21FLKWAT_WQX-OKA-CB-BASS-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
...	...	...	...	...	...	...
406356	21FLKWAT_WQX-SAN-SKI WATCH-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
406469	21FLCBA_WQX-OKA-CBA-GAP-1-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
407067	21FLCBA_WQX-OKA-CB-BASS-2	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
407084	21FLCBA_WQX-OKA-CBA-GAP-3-1	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN
407462	21FLKWAT_WQX-SAN-SKI WATCH-5	Not Reported	NaN	ResultMeasureValue: "Not Reported" result cann...	m	NaN

79 rows × 6 columns

If both value and unit are missing nothing can be done, a unitless (NaN) value is assumed as to be in default units but a QA_flag is added

[30]:

# Aggregate Secchi data by station
visualize.station_summary(sechi_results, 'Secchi')

[30]:

	MonitoringLocationIdentifier	cnt	mean
0	11NPSWRD_WQX-GUIS_CMP_PKT01	12	2.333333
1	11NPSWRD_WQX-GUIS_CMP_PKT02	17	2.411765
2	11NPSWRD_WQX-GUIS_CMP_PKT03	3	2.333333
3	21AWIC-1063	120	0.775167
4	21AWIC-1122	59	2.892102
...	...	...	...
1040	NARS_WQX-NCCA10-1432	1	1.075000
1041	NARS_WQX-NCCA10-1433	1	1.423333
1042	NARS_WQX-NCCA10-1434	1	2.400000
1043	NARS_WQX-NCCA10-1488	1	0.736667
1044	NARS_WQX-NCCA10-2432	1	1.600000

1045 rows × 3 columns

[31]:

# Map number of usable results at each station
gdf_count = visualize.map_counts(sechi_results, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/mapclassify/classifiers.py:1653: UserWarning: Not enough unique values in array to form 5 classes. Setting k to 4.
  self.bins = quantile(y, k=k)

[31]:

<Axes: >

../_images/notebooks_Harmonize_Pensacola_Detailed_43_2.png

[32]:

# Map average secchi depth results at each station
gdf_avg = visualize.map_measure(sechi_results, stations_clipped, 'Secchi')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)

[32]:

<Axes: >

../_images/notebooks_Harmonize_Pensacola_Detailed_44_1.png

Temperature

The default error=’raise’, makes it so that there is an error when there is a dimensionality error (i.e. when units can’t be converted). Here we would get the error: DimensionalityError: Cannot convert from ‘count’ (dimensionless) to ‘degree_Celsius’ ([temperature])

[33]:

#'Temperature, water'
# errors=‘ignore’, invalid dimension conversions will return the NaN.
df = harmonize.harmonize(df, 'Temperature, water', intermediate_columns=True, report=True, errors='ignore')

-Usable results-
count    83725.000000
mean        21.956358
std         10.266391
min        -12.944444
25%         17.000000
50%         22.200000
75%         27.140000
max       1876.000000
dtype: float64
Unusable results: 2
Usable results with inferred units: 10
Results outside threshold (0.0 to 83.55470407525675): 6

../_images/notebooks_Harmonize_Pensacola_Detailed_47_1.png

[34]:

# Look at what was changed
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Temperature', 'Units']
temperature_results = df.loc[df['CharacteristicName']=='Temperature, water', cols]
temperature_results

[34]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Temperature	Units
4	21FLCBA_WQX-BAS02	78.9	deg F	NaN	26.0555555555556 degree_Celsius	degF
6	21FLPNS_WQX-33020J10	12.35	deg C	NaN	12.35 degree_Celsius	degC
21	21FLSEAS_WQX-02SEAS810	23	deg C	NaN	23.0 degree_Celsius	degC
22	21FLPNS_WQX-33010G10	11.23	deg C	NaN	11.23 degree_Celsius	degC
37	AWW_WQX-aww_2174	27	deg C	NaN	27.0 degree_Celsius	degC
...	...	...	...	...	...	...
418444	USGS-302420087140301	23.0	deg C	NaN	23.0 degree_Celsius	degC
418446	USGS-302703087133501	24.0	deg C	NaN	24.0 degree_Celsius	degC
418449	USGS-303733086441001	22.2	deg C	NaN	22.2 degree_Celsius	degC
418452	USGS-302703087133502	24.0	deg C	NaN	24.0 degree_Celsius	degC
418455	USGS-02376108	24.4	deg C	NaN	24.4 degree_Celsius	degC

83727 rows × 6 columns

In the above we can see examples where the results were in deg F and in the result field they’ve been converted into degree_Celsius

[35]:

# Examine missing units
temperature_results.loc[df['ResultMeasure/MeasureUnitCode'].isna()]

[35]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Temperature	Units
148884	NARS_WQX-OWW04440-0401	NaN	NaN	ResultMeasureValue: missing (NaN) result; Resu...	NaN	degC
223742	21FLCBA-FWB05	79.8	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	79.8 degree_Celsius	degC
223808	21FLCBA-FWB05	81.7	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	81.7 degree_Celsius	degC
224689	21FLCBA-FWB02	82.1	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	82.1 degree_Celsius	degC
224690	21FLCBA-FWB02	82.6	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	82.6 degree_Celsius	degC
224691	21FLCBA-FWB02	71.8	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	71.8 degree_Celsius	degC
224692	21FLCBA-FWB02	79.4	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	79.4 degree_Celsius	degC
230373	21FLCBA-RIV02	74.2	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	74.2 degree_Celsius	degC
230374	21FLCBA-RIV02	74.2	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	74.2 degree_Celsius	degC
235004	21FLCBA-FWB01	83.3	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	83.3 degree_Celsius	degC
235088	21FLCBA-FWB01	71.2	NaN	ResultMeasure/MeasureUnitCode: MISSING UNITS, ...	71.2 degree_Celsius	degC

We can see where the units were missing, the results were assumed to be in degree_Celsius already

[36]:

# This is also noted in the QA_flag field
list(temperature_results.loc[df['ResultMeasure/MeasureUnitCode'].isna(), 'QA_flag'])[0]

[36]:

'ResultMeasureValue: missing (NaN) result; ResultMeasure/MeasureUnitCode: MISSING UNITS, degC assumed'

[37]:

# Look for any without usable results
temperature_results.loc[df['Temperature'].isna()]

[37]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Temperature	Units
47436	11NPSWRD_WQX-GUIS_NALO	NaN	deg C	ResultMeasureValue: missing (NaN) result	NaN	degC
148884	NARS_WQX-OWW04440-0401	NaN	NaN	ResultMeasureValue: missing (NaN) result; Resu...	NaN	degC

[38]:

# Aggregate temperature data by station
visualize.station_summary(temperature_results, 'Temperature')

[38]:

	MonitoringLocationIdentifier	cnt	mean
0	11NPSWRD_WQX-GUIS_ADEM_ALPT	30	24.986667
1	11NPSWRD_WQX-GUIS_BCCA	1	36.800000
2	11NPSWRD_WQX-GUIS_BISA	32	22.696250
3	11NPSWRD_WQX-GUIS_BOPI	1	32.000000
4	11NPSWRD_WQX-GUIS_CMP_PKT01	20	25.125000
...	...	...	...
2469	UWFCEDB_WQX-SRC-AI31-22	10	23.020000
2470	UWFCEDB_WQX-SRC-AI36-22	10	23.230000
2471	UWFCEDB_WQX-SRC-AI42-22	10	22.650000
2472	UWFCEDB_WQX-SRC-AI44-22	10	22.890000
2473	UWFCEDB_WQX-SRC-AK41-22	8	21.462500

2474 rows × 3 columns

[39]:

# Map number of usable results at each station
gdf_count = visualize.map_counts(temperature_results, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)

[39]:

<Axes: >

../_images/notebooks_Harmonize_Pensacola_Detailed_55_1.png

[40]:

# Map average temperature results at each station
gdf_temperature = visualize.map_measure(temperature_results, stations_clipped, 'Temperature')
gdf_temperature.plot(column='mean', cmap='OrRd', legend=True)

[40]:

<Axes: >

../_images/notebooks_Harmonize_Pensacola_Detailed_56_1.png

Dissolved oxygen

[41]:

# look at Dissolved oxygen (DO), but this time without intermediate fields
df = harmonize.harmonize(df, 'Dissolved oxygen (DO)')

Note: Imediately when we run a harmonization function without the intermediate fields they’re deleted.

[42]:

# Look at what was changed
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'DO']
do_res = df.loc[df['CharacteristicName']=='Dissolved oxygen (DO)', cols]
do_res

[42]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	DO
1	21FLSEAS_WQX-02SEAS740	6.3	mg/L	NaN	6.3 milligram / liter
5	21FLCMP_WQX-3201BM21	4.5	mg/L	NaN	4.5 milligram / liter
12	21FLPNS_WQX-33030D71	6.64	mg/L	NaN	6.64 milligram / liter
16	21FLBFA_WQX-33020057	1.17	mg/L	NaN	1.17 milligram / liter
25	21FLNUTT_WQX-PB02	8.11	mg/L	NaN	8.11 milligram / liter
...	...	...	...	...	...
418319	21AWIC-1122	8.261	mg/L	NaN	8.261 milligram / liter
418324	21FLESC_WQX-MARCUSCRKW20	8.71	mg/L	NaN	8.71 milligram / liter
418332	21FLESC_WQX-CRESCENTLAKE10	9.11	mg/L	NaN	9.11 milligram / liter
418342	21FLPNS_WQX-G5NW0070	3.73	mg/L	NaN	3.73 milligram / liter
418349	21FLPNS_WQX-G4NW0022	9.28	mg/L	NaN	9.28 milligram / liter

66208 rows × 5 columns

[43]:

do_res.loc[do_res['ResultMeasure/MeasureUnitCode']!='mg/l']

[43]:

	MonitoringLocationIdentifier	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	DO
1	21FLSEAS_WQX-02SEAS740	6.3	mg/L	NaN	6.3 milligram / liter
5	21FLCMP_WQX-3201BM21	4.5	mg/L	NaN	4.5 milligram / liter
12	21FLPNS_WQX-33030D71	6.64	mg/L	NaN	6.64 milligram / liter
16	21FLBFA_WQX-33020057	1.17	mg/L	NaN	1.17 milligram / liter
25	21FLNUTT_WQX-PB02	8.11	mg/L	NaN	8.11 milligram / liter
...	...	...	...	...	...
418319	21AWIC-1122	8.261	mg/L	NaN	8.261 milligram / liter
418324	21FLESC_WQX-MARCUSCRKW20	8.71	mg/L	NaN	8.71 milligram / liter
418332	21FLESC_WQX-CRESCENTLAKE10	9.11	mg/L	NaN	9.11 milligram / liter
418342	21FLPNS_WQX-G5NW0070	3.73	mg/L	NaN	3.73 milligram / liter
418349	21FLPNS_WQX-G4NW0022	9.28	mg/L	NaN	9.28 milligram / liter

43130 rows × 5 columns

Though there were no results in %, the conversion from percent saturation (%) to mg/l is special. This equation is being improved by integrating tempertaure and pressure instead of assuming STP (see DO_saturation())

[44]:

# Aggregate DO data by station
visualize.station_summary(do_res, 'DO')

[44]:

	MonitoringLocationIdentifier	cnt	mean
0	11NPSWRD_WQX-GUIS_ADEM_ALPT	30	6.698000
1	11NPSWRD_WQX-GUIS_BCCA	1	0.270000
2	11NPSWRD_WQX-GUIS_BISA	32	7.194375
3	11NPSWRD_WQX-GUIS_BOPI	1	7.540000
4	11NPSWRD_WQX-GUIS_FPPO	1	9.950000
...	...	...	...
1999	UWFCEDB_WQX-SRC-AI31-22	20	3.723390
2000	UWFCEDB_WQX-SRC-AI36-22	20	3.513705
2001	UWFCEDB_WQX-SRC-AI42-22	20	3.677337
2002	UWFCEDB_WQX-SRC-AI44-22	20	3.658370
2003	UWFCEDB_WQX-SRC-AK41-22	16	2.706512

2004 rows × 3 columns

[45]:

# Map number of usable results at each station
gdf_count = visualize.map_counts(do_res, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)

[45]:

<Axes: >

../_images/notebooks_Harmonize_Pensacola_Detailed_64_1.png

[46]:

# Map Averages at each station
gdf_avg = visualize.map_measure(do_res, stations_clipped, 'DO')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)

[46]:

<Axes: >

../_images/notebooks_Harmonize_Pensacola_Detailed_65_1.png

pH

[47]:

# pH, this time looking at a report
df = harmonize.harmonize(df, 'pH', report=True)

-Usable results-
count    57414.000000
mean         7.341551
std          0.904250
min          0.500000
25%          6.860000
50%          7.680000
75%          8.000000
max         16.200000
dtype: float64
Unusable results: 51
Usable results with inferred units: 36
Results outside threshold (0.0 to 12.767052173396209): 1

../_images/notebooks_Harmonize_Pensacola_Detailed_67_1.png

Note the warnings that occur when a unit is not recognized by the package. These occur even when report=False. Future versions could include these as defined units for pH, but here it wouldn’t alter results.

[48]:

df.loc[df['CharacteristicName']=='pH', ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'pH']]

[48]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	pH
2	7.29	None	NaN	7.29 dimensionless
19	8.09	None	NaN	8.09 dimensionless
24	7.45	None	NaN	7.45 dimensionless
27	6.57	None	NaN	6.57 dimensionless
29	6.57	None	NaN	6.57 dimensionless
...	...	...	...	...
418453	5.9	std units	NaN	5.9 dimensionless
418454	7.9	std units	NaN	7.9 dimensionless
418456	8.3	std units	NaN	8.3 dimensionless
418457	6.6	std units	NaN	6.6 dimensionless
418458	8.1	std units	NaN	8.1 dimensionless

57465 rows × 4 columns

‘None’ is uninterpretable and replaced with NaN, which then gets replaced with ‘dimensionless’ since pH is unitless

Salinity

[49]:

# Salinity
df = harmonize.harmonize(df, 'Salinity', report=True)

-Usable results-
count    71658.000000
mean        15.999914
std        152.600750
min          0.000000
25%          6.000000
50%         16.200000
75%         23.400000
max      37782.000000
dtype: float64
Unusable results: 417
Usable results with inferred units: 10
Results outside threshold (0.0 to 931.6044110771666): 4

../_images/notebooks_Harmonize_Pensacola_Detailed_72_1.png

[50]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity']
df.loc[df['CharacteristicName']=='Salinity', cols]

[50]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Salinity
9	18.9	ppth	NaN	18.9 Practical_Salinity_Units
11	11.82	ppth	NaN	11.82 Practical_Salinity_Units
17	.03	ppt	NaN	0.03 Practical_Salinity_Units
23	33.27	ppth	NaN	33.27 Practical_Salinity_Units
28	25.16	ppth	NaN	25.16 Practical_Salinity_Units
...	...	...	...	...
418348	0	PSS	NaN	0.0 Practical_Salinity_Units
418350	.02	ppt	NaN	0.02 Practical_Salinity_Units
418357	0.01	ppth	NaN	0.01 Practical_Salinity_Units
418358	23.81	ppth	NaN	23.81 Practical_Salinity_Units
418360	17.1252	ppt	NaN	17.1252 Practical_Salinity_Units

72075 rows × 4 columns

Nitrogen

[51]:

# Nitrogen
df = harmonize.harmonize(df, 'Nitrogen', report=True)

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'cm3/g' UNDEFINED UNIT for Nitrogen
  warn("WARNING: " + problem)

-Usable results-
count     109.000000
mean       26.920174
std       160.257726
min         0.000700
25%         0.410000
50%         0.629000
75%         1.120000
max      1630.000000
dtype: float64
Unusable results: 4
Usable results with inferred units: 0
Results outside threshold (0.0 to 988.466532186079): 1

../_images/notebooks_Harmonize_Pensacola_Detailed_75_2.png

[52]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Nitrogen']
df.loc[df['CharacteristicName']=='Nitrogen', cols]

[52]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Nitrogen
43096	0.3	mg/L	NaN	0.3 milligram / liter
43352	0.36	mg/L	NaN	0.36 milligram / liter
43447	0.33875	mg/L	NaN	0.33875 milligram / liter
43539	0.53125	mg/L	NaN	0.53125 milligram / liter
44119	135	mg/kg	NaN	135.00000000000003 milligram / liter
...	...	...	...	...
415651	18.69	mg/l	NaN	18.69 milligram / liter
415658	16.18	mg/l	NaN	16.18 milligram / liter
415659	18.99	mg/l	NaN	18.99 milligram / liter
415662	18.72	mg/l	NaN	18.72 milligram / liter
415664	17.61	mg/l	NaN	17.61 milligram / liter

113 rows × 4 columns

Conductivity

[53]:

# Conductivity
df = harmonize.harmonize(df, 'Conductivity', report=True)

-Usable results-
count     1818.000000
mean     17085.221414
std      16116.889030
min          0.040000
25%        130.000000
50%      16994.750000
75%      30306.650000
max      54886.200000
dtype: float64
Unusable results: 8
Usable results with inferred units: 0
Results outside threshold (0.0 to 113786.55559242623): 0

../_images/notebooks_Harmonize_Pensacola_Detailed_78_1.png

[54]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Conductivity']
df.loc[df['CharacteristicName']=='Conductivity', cols]

[54]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Conductivity
13	19204.2	umho/cm	NaN	19204.2 microsiemens / centimeter
92	222.3	umho/cm	NaN	222.3 microsiemens / centimeter
189	102.8	umho/cm	NaN	102.8 microsiemens / centimeter
379	11017.5	umho/cm	NaN	11017.5 microsiemens / centimeter
793	32	umho/cm	NaN	32.0 microsiemens / centimeter
...	...	...	...	...
415851	110	umho/cm	NaN	110.0 microsiemens / centimeter
415856	65	umho/cm	NaN	65.0 microsiemens / centimeter
415858	110	umho/cm	NaN	110.0 microsiemens / centimeter
415861	390	umho/cm	NaN	390.0 microsiemens / centimeter
415864	65	umho/cm	NaN	65.0 microsiemens / centimeter

1826 rows × 4 columns

Chlorophyll a

[55]:

# Chlorophyll a
df = harmonize.harmonize(df, 'Chlorophyll a', report=True)

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'None' UNDEFINED UNIT for Chlorophyll
  warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'ug/cm2' UNDEFINED UNIT for Chlorophyll
  warn("WARNING: " + problem)

-Usable results-
count    9266.000000
mean        1.169613
std         1.200140
min        -0.840000
25%         0.008300
50%         0.970000
75%         1.850000
max         9.990000
dtype: float64
Unusable results: 589
Usable results with inferred units: 0
Results outside threshold (0.0 to 8.370452978777962): 8

../_images/notebooks_Harmonize_Pensacola_Detailed_81_2.png

[56]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Chlorophyll']
df.loc[df['CharacteristicName']=='Chlorophyll a', cols]

[56]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Chlorophyll
244	NaN	NaN	ResultMeasureValue: missing (NaN) result; Resu...	NaN
547	NaN	NaN	ResultMeasureValue: missing (NaN) result; Resu...	NaN
660	2.3	mg/m3	NaN	0.0023000000000000004 milligram / liter
1207	NaN	NaN	ResultMeasureValue: missing (NaN) result; Resu...	NaN
1350	2.9	mg/m3	NaN	0.0029000000000000007 milligram / liter
...	...	...	...	...
417283	6.3	mg/m3	NaN	0.006300000000000001 milligram / liter
417307	NaN	NaN	ResultMeasureValue: missing (NaN) result; Resu...	NaN
417407	1.1	mg/m3	NaN	0.0011000000000000003 milligram / liter
417987	4.4	mg/m3	NaN	0.004400000000000001 milligram / liter
418017	NaN	NaN	ResultMeasureValue: missing (NaN) result; Resu...	NaN

9855 rows × 4 columns

Organic Carbon

[57]:

# Organic carbon (%)
df = harmonize.harmonize(df, 'Organic carbon', report=True)

-Usable results-
count      4820.000000
mean       1139.767819
std       11597.098499
min           0.000000
25%           2.700000
50%           4.400000
75%           8.300000
max      410000.000000
dtype: float64
Unusable results: 165
Usable results with inferred units: 0
Results outside threshold (0.0 to 70722.35881310055): 22

../_images/notebooks_Harmonize_Pensacola_Detailed_84_1.png

[58]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Carbon']
df.loc[df['CharacteristicName']=='Organic carbon', cols]

[58]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Carbon
121	2.6	mg/L	NaN	2.6 milligram / liter
156	3.9	mg/L	NaN	3.9 milligram / liter
206	5.2	mg/L	NaN	5.2 milligram / liter
309	4.8	mg/L	NaN	4.8 milligram / liter
338	1.0	mg/L	NaN	1.0 milligram / liter
...	...	...	...	...
418239	10	mg/L	NaN	10.0 milligram / liter
418288	14	mg/L	NaN	14.0 milligram / liter
418300	3.2	mg/L	NaN	3.2 milligram / liter
418313	3.1	mg/L	NaN	3.1 milligram / liter
418330	4.297	mg/L	NaN	4.297 milligram / liter

4985 rows × 4 columns

Turbidity

[59]:

# Turbidity (NTU)
df = harmonize.harmonize(df, 'Turbidity', report=True)

-Usable results-
count    40718.000000
mean         8.926798
std        203.564637
min         -0.840000
25%          1.420000
50%          2.590000
75%          4.900000
max      32342.452300
dtype: float64
Unusable results: 596
Usable results with inferred units: 10
Results outside threshold (0.0 to 1230.3146187462137): 45

../_images/notebooks_Harmonize_Pensacola_Detailed_87_1.png

[60]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Turbidity']
df.loc[df['CharacteristicName']=='Turbidity', cols]

[60]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Turbidity
26	0	NTU	NaN	0.0 Nephelometric_Turbidity_Units
45	1.4	NTU	NaN	1.4 Nephelometric_Turbidity_Units
50	4.7	NTU	NaN	4.7 Nephelometric_Turbidity_Units
101	5.6	NTU	NaN	5.6 Nephelometric_Turbidity_Units
103	28	NTU	NaN	28.0 Nephelometric_Turbidity_Units
...	...	...	...	...
418351	0.73	NTU	NaN	0.73 Nephelometric_Turbidity_Units
418362	4.7	NTU	NaN	4.7 Nephelometric_Turbidity_Units
418374	0.0	JTU	NaN	-0.0477 Nephelometric_Turbidity_Units
418393	50	JTU	NaN	951.2022999999999 Nephelometric_Turbidity_Units
418401	1700	JTU	NaN	32342.452299999997 Nephelometric_Turbidity_Units

41314 rows × 4 columns

Sediment

[61]:

# Sediment
df = harmonize.harmonize(df, 'Sediment', report=False)

[62]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Sediment']
df.loc[df['CharacteristicName']=='Sediment', cols]

[62]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Sediment

Phosphorus

Note: must be merged w/ activities (package runs query by site if not already merged)

[63]:

# Phosphorus
df = harmonize.harmonize(df, 'Phosphorus')

2 Phosphorus sample fractions not in frac_dict
2 Phosphorus sample fractions not in frac_dict found in expected domains, mapped to "Other_Phosphorus"

Note: warnings for unexpected characteristic fractions. Fractions are each seperated out into their own result column.

[64]:

# All Phosphorus
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'TDP_Phosphorus']
df.loc[df['Phosphorus'].notna(), cols]

[64]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	TDP_Phosphorus
35	.061	mg/L	NaN	NaN
66	0.03	mg/L	NaN	NaN
151	.13	mg/L	NaN	NaN
176	0.003	mg/L	NaN	NaN
315	0.002	mg/L	NaN	NaN
...	...	...	...	...
418214	.107	mg/L	NaN	NaN
418271	0.02	mg/L	NaN	NaN
418274	0.004	mg/L	NaN	NaN
418293	0.005	mg/L	NaN	NaN
418344	0.02	mg/L	NaN	NaN

6799 rows × 4 columns

[65]:

# Total phosphorus
df.loc[df['TP_Phosphorus'].notna(), cols]

[65]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	TDP_Phosphorus
35	.061	mg/L	NaN	NaN
66	0.03	mg/L	NaN	NaN
151	.13	mg/L	NaN	NaN
176	0.003	mg/L	NaN	NaN
315	0.002	mg/L	NaN	NaN
...	...	...	...	...
418214	.107	mg/L	NaN	NaN
418271	0.02	mg/L	NaN	NaN
418274	0.004	mg/L	NaN	NaN
418293	0.005	mg/L	NaN	NaN
418344	0.02	mg/L	NaN	NaN

6069 rows × 4 columns

[66]:

# Total dissolved phosphorus
df.loc[df['TDP_Phosphorus'].notna(), cols]

[66]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	TDP_Phosphorus
3684	0.019	mg/L	NaN	0.019 milligram / liter
8180	0.002	mg/L	NaN	0.002 milligram / liter
23210	0.003	mg/L	NaN	0.003 milligram / liter
26305	0.019	mg/L	NaN	0.019 milligram / liter
56249	0.002	mg/L	NaN	0.002 milligram / liter
58177	0.017	mg/L	NaN	0.017 milligram / liter
91512	0.021	mg/L	NaN	0.021 milligram / liter
94313	0.003	mg/L	NaN	0.003 milligram / liter
101207	0.020	mg/L	NaN	0.02 milligram / liter
107151	0.002	mg/L	NaN	0.002 milligram / liter
148518	0.00806	mg/L	NaN	0.00806 milligram / liter
151929	0.000031	mg/L	NaN	3.1e-05 milligram / liter
153142	0.002542	mg/L	NaN	0.002542 milligram / liter
153726	0.00341	mg/L	NaN	0.00341 milligram / liter
205233	0.00372	mg/L	NaN	0.00372 milligram / liter
207271	0.00961	mg/L	NaN	0.00961 milligram / liter
208186	0.00124	mg/L	NaN	0.00124 milligram / liter
209016	0.01271	mg/L	NaN	0.01271 milligram / liter
414049	0.030	mg/l as P	NaN	0.03 milligram / liter
414058	0.033	mg/l as P	NaN	0.033 milligram / liter
414061	0.024	mg/l as P	NaN	0.024 milligram / liter
414067	0.028	mg/l as P	NaN	0.028 milligram / liter
414076	0.021	mg/l as P	NaN	0.021 milligram / liter
414081	0.023	mg/l as P	NaN	0.023 milligram / liter
414093	0.037	mg/l as P	NaN	0.037 milligram / liter
414158	0.023	mg/l as P	NaN	0.023 milligram / liter
414166	0.02	mg/l as P	NaN	0.02 milligram / liter
414182	0.04	mg/l as P	NaN	0.04 milligram / liter
414197	0.03	mg/l as P	NaN	0.03 milligram / liter
414206	0.025	mg/l as P	NaN	0.025 milligram / liter
414233	0.05	mg/l as P	NaN	0.05 milligram / liter
414242	0.15	mg/l as P	NaN	0.15 milligram / liter
414262	0.03	mg/l as P	NaN	0.03 milligram / liter
414379	0.02	mg/l as P	NaN	0.02 milligram / liter
414401	0.07	mg/l as P	NaN	0.07 milligram / liter
414409	0.08	mg/l as P	NaN	0.08 milligram / liter
414424	0.02	mg/l as P	NaN	0.02 milligram / liter
414444	0.02	mg/l as P	NaN	0.02 milligram / liter
414456	0.04	mg/l as P	NaN	0.04 milligram / liter
414476	0.02	mg/l as P	NaN	0.02 milligram / liter
414489	0.05	mg/l as P	NaN	0.05 milligram / liter
415374	0.03	mg/l as P	NaN	0.03 milligram / liter
415381	0.05	mg/l as P	NaN	0.05 milligram / liter

[67]:

# All other phosphorus sample fractions
df.loc[df['Other_Phosphorus'].notna(), cols]

[67]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	TDP_Phosphorus
10359	.5	mg/L	NaN	NaN
10435	.036	mg/L	NaN	NaN
11767	.089	mg/L	NaN	NaN
12727	.017	mg/L	NaN	NaN
13836	.035	mg/L	NaN	NaN
...	...	...	...	...
415820	.18	mg/L	NaN	NaN
415829	.25	mg/L	NaN	NaN
415843	.16	mg/L	NaN	NaN
415846	.18	mg/L	NaN	NaN
415855	.31	mg/L	NaN	NaN

687 rows × 4 columns

Bacteria

Some equivalence assumptions are built-in where bacteria counts that are not equivalent are treated as such because there is no standard way to convert from one to another.

Fecal Coliform

[68]:

# Known unit with bad dimensionality ('Colony_Forming_Units * milliliter')
df = harmonize.harmonize(df, 'Fecal Coliform', report=True, errors='ignore')

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")

-Usable results-
count    10035.000000
mean        45.537618
std        448.839329
min          0.000000
25%          4.000000
50%          8.000000
75%         33.000000
max      33000.000000
dtype: float64
Unusable results: 40571
Usable results with inferred units: 0
Results outside threshold (0.0 to 2738.573594138782): 6

../_images/notebooks_Harmonize_Pensacola_Detailed_103_2.png

[69]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Fecal_Coliform']
df.loc[df['CharacteristicName']=='Fecal Coliform', cols]

[69]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Fecal_Coliform
0	*Non-detect	NaN	ResultMeasureValue: "*Non-detect" result canno...	NaN
3	80	cfu/100mL	NaN	NaN
7	*Non-detect	NaN	ResultMeasureValue: "*Non-detect" result canno...	NaN
8	2	MPN/100mL	NaN	NaN
10	*Non-detect	NaN	ResultMeasureValue: "*Non-detect" result canno...	NaN
...	...	...	...	...
415706	245	cfu/100mL	NaN	NaN
417067	260	cfu/100mL	NaN	NaN
417221	120	cfu/100mL	NaN	NaN
417281	11	cfu/100mL	NaN	NaN
417464	58	cfu/100mL	NaN	NaN

50606 rows × 4 columns

Escherichia coli

[70]:

# Known unit with bad dimensionality ('Colony_Forming_Units * milliliter')
df = harmonize.harmonize(df, 'Escherichia coli', report=True, errors='ignore')

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")

-Usable results-
count      22.000000
mean      501.863636
std       610.053260
min         4.000000
25%         9.500000
50%        77.500000
75%      1000.000000
max      1700.000000
dtype: float64
Unusable results: 8930
Usable results with inferred units: 0
Results outside threshold (0.0 to 4162.183198738116): 0

../_images/notebooks_Harmonize_Pensacola_Detailed_106_2.png

[71]:

cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'E_coli']
df.loc[df['CharacteristicName']=='Escherichia coli', cols]

[71]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	E_coli
20	0	cfu/100mL	NaN	NaN
63	1000	cfu/100mL	NaN	NaN
69	33.3333333333333	cfu/100mL	NaN	NaN
82	0	cfu/100mL	NaN	NaN
136	0	cfu/100mL	NaN	NaN
...	...	...	...	...
418259	231	MPN/100mL	NaN	NaN
418297	85	MPN/100mL	NaN	NaN
418343	171	MPN/100mL	NaN	NaN
418353	30	MPN/100mL	NaN	NaN
418361	31	MPN/100mL	NaN	NaN

8952 rows × 4 columns

Combining Salinity and Conductivity

Convert module has various functions to convert from one unit or characteristic to another. Some of these are used within a single characteristic during harmonization (e.g. DO saturation to concentration) while others are intended to model one characteristic as an indicator of another (e.g. estimate salinity from conductivity).

Note: this should only be done after both characteristic fields have been harmonized. Results before and after should be inspected, thresholds for outliers applied, and consider adding a QA_flag for modeled data.

Explore Salinity results:

[72]:

from harmonize_wq import convert

[73]:

# Salinity summary statistics
lst = [x.magnitude for x in list(df['Salinity'].dropna())]
q_sum = sum(lst)
print('Range: {} to {}'.format(min(lst), max(lst)))
print('Results: {} \nMean: {} PSU'.format(len(lst), q_sum/len(lst)))

Range: 0.0 to 37782.0
Results: 71658
Mean: 15.999914010997138 PSU

[74]:

# Identify extreme outliers
[x for x in lst if x >3200]

[74]:

[15030.0, 37782.0]

Other fields like units and QA_flag may help understand what caused high values and what results might need to be dropped from consideration

[75]:

# Columns to focus on
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity']

[76]:

# Look at important fields for max 5 values
salinity_series = df['Salinity'][df['Salinity'].notna()]
salinity_series.sort_values(ascending=False, inplace=True)
df[cols][df['Salinity'].isin(salinity_series[0:5])]

[76]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Salinity
21771	15030	ppt	NaN	15030.0 Practical_Salinity_Units
31254	322	ppth	NaN	322.0 Practical_Salinity_Units
42625	2150	ppth	NaN	2150.0 Practical_Salinity_Units
69596	37782	ppth	NaN	37782.0 Practical_Salinity_Units
143785	2190	ppt	NaN	2190.0 Practical_Salinity_Units

Detection limits may help understand what caused low values and what results might need to be dropped or updated

[77]:

from harmonize_wq import wrangle

[78]:

df = wrangle.add_detection(df, 'Salinity')
cols+=['ResultDetectionConditionText',
       'DetectionQuantitationLimitTypeName',
       'DetectionQuantitationLimitMeasure/MeasureValue',
       'DetectionQuantitationLimitMeasure/MeasureUnitCode']

[79]:

# Look at important fields for min 5 values (often multiple 0.0)
df[cols][df['Salinity'].isin(salinity_series[-5:])]

[79]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Salinity	ResultDetectionConditionText	DetectionQuantitationLimitTypeName	DetectionQuantitationLimitMeasure/MeasureValue	DetectionQuantitationLimitMeasure/MeasureUnitCode
2168	0.00	ppth	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
3808	0.00	ppth	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
18985	0.0	ppth	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
27307	0.0	ppth	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
32679	0.0	ppth	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...
418176	0	PSS	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
418180	0	PSS	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
418191	0	PSS	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
418318	0	PSS	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN
418348	0	PSS	NaN	0.0 Practical_Salinity_Units	NaN	NaN	NaN	NaN

2909 rows × 8 columns

Explore Conductivity results:

[80]:

# Create series and inspect Conductivity values
cond_series = df['Conductivity'].dropna()
cond_series

[80]:

13        19204.2 microsiemens / centimeter
92          222.3 microsiemens / centimeter
189         102.8 microsiemens / centimeter
379       11017.5 microsiemens / centimeter
793          32.0 microsiemens / centimeter
                        ...
415851      110.0 microsiemens / centimeter
415856       65.0 microsiemens / centimeter
415858      110.0 microsiemens / centimeter
415861      390.0 microsiemens / centimeter
415864       65.0 microsiemens / centimeter
Name: Conductivity, Length: 1818, dtype: object

Conductivity thresholds from Freshwater Explorer: 10 > x < 5000 us/cm, use a higher threshold for coastal waters

[81]:

# Sort and check other relevant columns before converting (e.g. Salinity)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity', 'Conductivity']
df.sort_values(by=['Conductivity'], ascending=False, inplace=True)
df.loc[df['Conductivity'].notna(), cols]

[81]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Salinity	Conductivity
115517	54886.2	umho/cm	NaN	NaN	54886.2 microsiemens / centimeter
118761	54871.3	umho/cm	NaN	NaN	54871.3 microsiemens / centimeter
112192	54860.6	umho/cm	NaN	NaN	54860.6 microsiemens / centimeter
117526	54859.3	umho/cm	NaN	NaN	54859.3 microsiemens / centimeter
111578	54850.8	umho/cm	NaN	NaN	54850.8 microsiemens / centimeter
...	...	...	...	...	...
76805	6.8	umho/cm	NaN	NaN	6.8 microsiemens / centimeter
51581	2	umho/cm	NaN	NaN	2.0 microsiemens / centimeter
143904	2	umho/cm	NaN	NaN	2.0 microsiemens / centimeter
36860	1	umho/cm	NaN	NaN	1.0 microsiemens / centimeter
145111	.04	umho/cm	NaN	NaN	0.04 microsiemens / centimeter

1818 rows × 5 columns

[82]:

# Check other relevant columns before converting (e.g. Salinity)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity', 'Conductivity']
df.loc[df['Conductivity'].notna(), cols]

[82]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	QA_flag	Salinity	Conductivity
115517	54886.2	umho/cm	NaN	NaN	54886.2 microsiemens / centimeter
118761	54871.3	umho/cm	NaN	NaN	54871.3 microsiemens / centimeter
112192	54860.6	umho/cm	NaN	NaN	54860.6 microsiemens / centimeter
117526	54859.3	umho/cm	NaN	NaN	54859.3 microsiemens / centimeter
111578	54850.8	umho/cm	NaN	NaN	54850.8 microsiemens / centimeter
...	...	...	...	...	...
76805	6.8	umho/cm	NaN	NaN	6.8 microsiemens / centimeter
51581	2	umho/cm	NaN	NaN	2.0 microsiemens / centimeter
143904	2	umho/cm	NaN	NaN	2.0 microsiemens / centimeter
36860	1	umho/cm	NaN	NaN	1.0 microsiemens / centimeter
145111	.04	umho/cm	NaN	NaN	0.04 microsiemens / centimeter

1818 rows × 5 columns

[83]:

# Convert values to PSU and write to Salinity
cond_series = cond_series.apply(str)  # Convert to string to convert to dimensionless (PSU)
df.loc[df['Conductivity'].notna(), 'Salinity'] = cond_series.apply(convert.conductivity_to_PSU)
df.loc[df['Conductivity'].notna(), 'Salinity']

[83]:

115517    36.356 dimensionless
118761    36.345 dimensionless
112192    36.338 dimensionless
117526    36.336 dimensionless
111578     36.33 dimensionless
                  ...
76805      0.013 dimensionless
51581      0.012 dimensionless
143904     0.012 dimensionless
36860      0.012 dimensionless
145111     0.012 dimensionless
Name: Salinity, Length: 1818, dtype: object

Datetime

datetime() formats time using dataretrieval and ActivityStart

[84]:

# First inspect the existing unformated fields
cols = ['ActivityStartDate', 'ActivityStartTime/Time', 'ActivityStartTime/TimeZoneCode']
df[cols]

[84]:

	ActivityStartDate	ActivityStartTime/Time	ActivityStartTime/TimeZoneCode
115517	2007-08-09	12:15:00	CST
118761	2007-08-09	12:15:00	CST
112192	2007-08-09	12:15:00	CST
117526	2007-08-09	12:15:00	CST
111578	2007-08-09	12:15:00	CST
...	...	...	...
418454	1959-02-04	NaN	NaN
418455	1958-01-14	09:20:00	CST
418456	1958-01-14	09:20:00	CST
418457	1959-02-05	NaN	NaN
418458	1959-02-02	NaN	NaN

418459 rows × 3 columns

[85]:

# 'ActivityStartDate' presserves date where 'Activity_datetime' is NAT due to no time zone
df = clean.datetime(df)
df[['ActivityStartDate', 'Activity_datetime']]

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/dataretrieval/utils.py:87: UserWarning: Warning: 35999 incomplete dates found, consider setting datetime_index to False.
  warnings.warn(

[85]:

	ActivityStartDate	Activity_datetime
115517	2007-08-09	2007-08-09 18:15:00+00:00
118761	2007-08-09	2007-08-09 18:15:00+00:00
112192	2007-08-09	2007-08-09 18:15:00+00:00
117526	2007-08-09	2007-08-09 18:15:00+00:00
111578	2007-08-09	2007-08-09 18:15:00+00:00
...	...	...
418454	1959-02-04	NaT
418455	1958-01-14	1958-01-14 15:20:00+00:00
418456	1958-01-14	1958-01-14 15:20:00+00:00
418457	1959-02-05	NaT
418458	1959-02-02	NaT

418459 rows × 2 columns

Activity_datetime combines all three time component columns into UTC. If time is missing this is NaT so a ActivityStartDate column is used to preserve date only.

Depth

Note: Data are often lacking sample depth metadata

[86]:

# Depth of sample (default units='meter')
df = clean.harmonize_depth(df)
#df.loc[df['ResultDepthHeightMeasure/MeasureValue'].dropna(), "Depth"]
df['ResultDepthHeightMeasure/MeasureValue'].dropna()

[86]:

1588       7.0
3668       7.0
78393      1.0
78895     16.0
79291     16.0
          ...
100784     1.3
100788     0.5
100823     2.0
100880     2.2
143344    35.0
Name: ResultDepthHeightMeasure/MeasureValue, Length: 179, dtype: float64

Characteristic to Column (long to wide format)

[87]:

# Split single QA column into multiple by characteristic (rename the result to preserve these QA_flags)
df2 = wrangle.split_col(df)
df2

[87]:

	OrganizationIdentifier	OrganizationFormalName	ActivityIdentifier	ActivityStartDate	ActivityStartTime/Time	ActivityStartTime/TimeZoneCode	MonitoringLocationIdentifier	ResultIdentifier	DataLoggerLine	ResultDetectionConditionText	...	QA_Carbon	QA_Salinity	QA_TP_Phosphorus	QA_TDP_Phosphorus	QA_Other_Phosphorus	QA_DO	QA_Chlorophyll	QA_Nitrogen	QA_Temperature	QA_E_coli
115517	21AWIC	ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA...	21AWIC-51908_230231_173	2007-08-09	12:15:00	-0600	21AWIC-1122	STORET-170383613	230231.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
118761	21AWIC	ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA...	21AWIC-51908_230230_173	2007-08-09	12:15:00	-0600	21AWIC-1122	STORET-170383607	230230.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
112192	21AWIC	ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA...	21AWIC-51908_230228_173	2007-08-09	12:15:00	-0600	21AWIC-1122	STORET-170383595	230228.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
117526	21AWIC	ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA...	21AWIC-51908_230229_173	2007-08-09	12:15:00	-0600	21AWIC-1122	STORET-170383601	230229.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
111578	21AWIC	ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA...	21AWIC-51908_230227_173	2007-08-09	12:15:00	-0600	21AWIC-1122	STORET-170383589	230227.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
418454	USGS-FL	USGS Florida Water Science Center	nwisfl.01.95900924	1959-02-04	NaN	NaN	USGS-303745086442101	NWIS-98928104	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
418455	USGS-FL	USGS Florida Water Science Center	nwisfl.01.95800572	1958-01-14	09:20:00	-0600	USGS-02376108	NWIS-6891392	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
418456	USGS-FL	USGS Florida Water Science Center	nwisfl.01.95800572	1958-01-14	09:20:00	-0600	USGS-02376108	NWIS-6891396	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
418457	USGS-FL	USGS Florida Water Science Center	nwisfl.01.95900926	1959-02-05	NaN	NaN	USGS-303820086241802	NWIS-6916678	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
418458	USGS-FL	USGS Florida Water Science Center	nwisfl.01.95900852	1959-02-02	NaN	NaN	USGS-302330086482001	NWIS-6916402	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

366769 rows × 117 columns

[88]:

# This expands the single col (QA_flag) out to a number of new columns based on the unique characteristicNames and speciation
print('{} new columns'.format(len(df2.columns) - len(df.columns)))

14 new columns

[89]:

# Note: there are fewer rows because NAN results are also dropped in this step
print('{} fewer rows'.format(len(df)-len(df2)))

51690 fewer rows

[90]:

#Examine Carbon flags from earlier in notebook (note these are empty now because NAN is dropped)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'Carbon', 'QA_Carbon']
df2.loc[df2['QA_Carbon'].notna(), cols]

[90]:

	ResultMeasureValue	ResultMeasure/MeasureUnitCode	Carbon	QA_Carbon

Next the table is divided into the columns of interest (main_df) and characteristic specific metadata (chars_df)

[91]:

# split table into main and characteristics tables
main_df, chars_df = wrangle.split_table(df2)

[92]:

# Columns still in main table
main_df.columns

[92]:

Index(['OrganizationIdentifier', 'OrganizationFormalName',
       'ActivityIdentifier', 'MonitoringLocationIdentifier', 'ProviderName',
       'Secchi', 'Temperature', 'DO', 'pH', 'Salinity', 'Nitrogen',
       'Speciation', 'TOTAL NITROGEN_ MIXED FORMS', 'Conductivity',
       'Chlorophyll', 'Carbon', 'Turbidity', 'Sediment', 'Phosphorus',
       'TP_Phosphorus', 'TDP_Phosphorus', 'Other_Phosphorus', 'Fecal_Coliform',
       'E_coli', 'DetectionQuantitationLimitTypeName',
       'DetectionQuantitationLimitMeasure/MeasureValue',
       'DetectionQuantitationLimitMeasure/MeasureUnitCode',
       'Activity_datetime', 'Depth', 'QA_Fecal_Coliform', 'QA_Turbidity',
       'QA_pH', 'QA_Secchi', 'QA_Conductivity', 'QA_Carbon', 'QA_Salinity',
       'QA_TP_Phosphorus', 'QA_TDP_Phosphorus', 'QA_Other_Phosphorus', 'QA_DO',
       'QA_Chlorophyll', 'QA_Nitrogen', 'QA_Temperature', 'QA_E_coli'],
      dtype='object')

[93]:

# look at main table results (first 5)
main_df.head()

[93]:

	OrganizationIdentifier	OrganizationFormalName	ActivityIdentifier	MonitoringLocationIdentifier	ProviderName	Secchi	Temperature	DO	pH	Salinity	...	QA_Carbon	QA_Salinity	QA_TP_Phosphorus	QA_TDP_Phosphorus	QA_Other_Phosphorus	QA_DO	QA_Chlorophyll	QA_Nitrogen	QA_Temperature	QA_E_coli
115517	21AWIC	ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA...	21AWIC-51908_230231_173	21AWIC-1122	STORET	NaN	NaN	NaN	NaN	36.356 dimensionless	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
118761	21AWIC	ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA...	21AWIC-51908_230230_173	21AWIC-1122	STORET	NaN	NaN	NaN	NaN	36.345 dimensionless	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
112192	21AWIC	ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA...	21AWIC-51908_230228_173	21AWIC-1122	STORET	NaN	NaN	NaN	NaN	36.338 dimensionless	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
117526	21AWIC	ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA...	21AWIC-51908_230229_173	21AWIC-1122	STORET	NaN	NaN	NaN	NaN	36.336 dimensionless	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
111578	21AWIC	ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA...	21AWIC-51908_230227_173	21AWIC-1122	STORET	NaN	NaN	NaN	NaN	36.33 dimensionless	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 44 columns

[94]:

# Empty columns that could be dropped (Mostly QA columns)
cols = list(main_df.columns)
x = main_df.dropna(axis=1, how='all')
[col for col in cols if col not in x.columns]

[94]:

['Sediment',
 'QA_Fecal_Coliform',
 'QA_Secchi',
 'QA_Conductivity',
 'QA_Carbon',
 'QA_TP_Phosphorus',
 'QA_TDP_Phosphorus',
 'QA_Other_Phosphorus',
 'QA_E_coli']

[95]:

# Map average results at each station
gdf_avg = visualize.map_measure(main_df, stations_clipped, 'Temperature')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)

[95]:

<Axes: >

../_images/notebooks_Harmonize_Pensacola_Detailed_146_1.png