Pensacola Bay FL - Detailed step-by-step

Standardize, clean and wrangle Water Quality Portal data in Pensacola and Perdido Bays into more analytic-ready formats using the harmonize_wq package

US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonixe_wq package is intended to be a flexible water quality specific framework to help:

  • Identify differences in data units (including speciation and basis)

  • Identify differences in sampling or analytic methods

  • Resolve data errors using transparent assumptions

  • Reduce data to the columns that are most commonly needed

  • Transform data from long to wide format

Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.

Detailed step-by-step workflow

This example workflow takes a deeper dive into some of the expanded functionality to examine results for different water quality parameters in Pensacola and Perdido Bays

Install and import the required libraries

[1]:
import sys
#!python -m pip uninstall harmonize-wq --yes
#!python -m pip install harmonize-wq --yes
# Use pip to install the package from pypi or the latest from github
#!{sys.executable} -m pip install harmonize-wq
# For latest dev version
#!{sys.executable} -m pip install git+https://github.com/USEPA/harmonize-wq.git@new_release_0-3-8
[2]:
import dataretrieval.wqp as wqp
from harmonize_wq import wrangle
from harmonize_wq import location
from harmonize_wq import harmonize
from harmonize_wq import visualize
from harmonize_wq import clean
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/dataretrieval/nadp.py:44: UserWarning: GDAL not installed. Some functions will not work.
  warnings.warn('GDAL not installed. Some functions will not work.')

Download location data using dataretrieval

[3]:
# Read geometry for Area of Interest from geojson file url and plot
aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson'
aoi_gdf = wrangle.as_gdf(aoi_url).to_crs(epsg=4326)  # already standard 4326
aoi_gdf.plot()
[3]:
<Axes: >
../_images/notebooks_Harmonize_Pensacola_Detailed_9_1.png
[4]:
# Note there are actually two polygons (one for each Bay)
aoi_gdf
# Spatial query parameters can be updated to run just one
bBox = wrangle.get_bounding_box(aoi_gdf)
# For only one bay, e.g., first is Pensacola Bay:
#bBox = wrangle.get_bounding_box(aoi_gdf, 0)
[5]:
# Build query with characteristicNames and the AOI extent
query = {'characteristicName': ['Phosphorus',
                                'Temperature, water',
                                'Depth, Secchi disk depth',
                                'Dissolved oxygen (DO)',
                                'Salinity',
                                'pH',
                                'Nitrogen',
                                'Conductivity',
                                'Organic carbon',
                                'Chlorophyll a',
                                'Turbidity',
                                'Sediment',
                                'Fecal Coliform',
                                'Escherichia coli']}
query['bBox'] = bBox
[6]:
# Query stations (can be slow)
stations, site_md = wqp.what_sites(**query)
[7]:
# Rows and columns for results
stations.shape
[7]:
(2863, 37)
[8]:
# First 5 rows
stations.head()
[8]:
OrganizationIdentifier OrganizationFormalName MonitoringLocationIdentifier MonitoringLocationName MonitoringLocationTypeName MonitoringLocationDescriptionText HUCEightDigitCode DrainageAreaMeasure/MeasureValue DrainageAreaMeasure/MeasureUnitCode ContributingDrainageAreaMeasure/MeasureValue ... AquiferName LocalAqfrName FormationTypeText AquiferTypeName ConstructionDateText WellDepthMeasure/MeasureValue WellDepthMeasure/MeasureUnitCode WellHoleDepthMeasure/MeasureValue WellHoleDepthMeasure/MeasureUnitCode ProviderName
0 USGS-AL USGS Alabama Water Science Center USGS-02376115 ELEVENMILE CREEK NR WEST PENSACOLA, FL Stream NaN 3140107.0 27.8 sq mi 27.8 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
1 USGS-AL USGS Alabama Water Science Center USGS-02377570 STYX RIVER NEAR ELSANOR, AL. Stream NaN 3140106.0 192.0 sq mi 192.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
2 USGS-AL USGS Alabama Water Science Center USGS-02377920 BLACKWATER RIVER AT US HWY 90 NR ROBERTSDALE, AL. Stream NaN 3140106.0 23.1 sq mi 23.1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
3 USGS-AL USGS Alabama Water Science Center USGS-02377960 BLACKWATER RIVER AT CO RD 87 NEAR ELSANOR, AL. Stream NaN 3140106.0 56.6 sq mi 56.6 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
4 USGS-AL USGS Alabama Water Science Center USGS-02377975 BLACKWATER RIVER ABOVE SEMINOLE AL Stream NaN 3140106.0 40.2 sq mi NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS

5 rows × 37 columns

[9]:
# Columns used for an example row
stations.iloc[0][['HorizontalCoordinateReferenceSystemDatumName', 'LatitudeMeasure', 'LongitudeMeasure']]
[9]:
HorizontalCoordinateReferenceSystemDatumName        NAD83
LatitudeMeasure                                 30.498252
LongitudeMeasure                               -87.335809
Name: 0, dtype: object
[10]:
# Harmonize location datums to 4326 (Note we keep intermediate columns using intermediate_columns=True)
stations_gdf = location.harmonize_locations(stations, out_EPSG=4326, intermediate_columns=True)
[11]:
location.harmonize_locations?
[12]:
# Rows and columns for results after running the function (5 new columns, only 2 new if intermediate_columns=False)
stations_gdf.shape
[12]:
(2863, 42)
[13]:
# Example results for the new columns
stations_gdf.iloc[0][['geom_orig', 'EPSG', 'QA_flag', 'geom', 'geometry']]
[13]:
geom_orig         (-87.3358086, 30.49825159)
EPSG                                  4269.0
QA_flag                                  NaN
geom         POINT (-87.3358086 30.49825159)
geometry     POINT (-87.3358086 30.49825159)
Name: 0, dtype: object
[14]:
# geom and geometry look the same but geometry is a special datatype
stations_gdf['geometry'].dtype
[14]:
<geopandas.array.GeometryDtype at 0x7f9be0376910>
[15]:
# Look at the different QA_flag flags that have been assigned,
# e.g., for bad datums or limited decimal precision
set(stations_gdf.loc[stations_gdf['QA_flag'].notna()]['QA_flag'])
[15]:
{'HorizontalCoordinateReferenceSystemDatumName: Bad datum OTHER, EPSG:4326 assumed',
 'HorizontalCoordinateReferenceSystemDatumName: Bad datum UNKWN, EPSG:4326 assumed',
 'LatitudeMeasure: Imprecise: lessthan3decimaldigits',
 'LatitudeMeasure: Imprecise: lessthan3decimaldigits; HorizontalCoordinateReferenceSystemDatumName: Bad datum UNKWN, EPSG:4326 assumed',
 'LatitudeMeasure: Imprecise: lessthan3decimaldigits; LongitudeMeasure: Imprecise: lessthan3decimaldigits',
 'LongitudeMeasure: Imprecise: lessthan3decimaldigits',
 'LongitudeMeasure: Imprecise: lessthan3decimaldigits; HorizontalCoordinateReferenceSystemDatumName: Bad datum UNKWN, EPSG:4326 assumed'}
[16]:
# Map it
stations_gdf.plot()
[16]:
<Axes: >
../_images/notebooks_Harmonize_Pensacola_Detailed_22_1.png
[17]:
# Clip to area of interest
stations_clipped = wrangle.clip_stations(stations_gdf, aoi_gdf)
[18]:
# Map it
stations_clipped.plot()
[18]:
<Axes: >
../_images/notebooks_Harmonize_Pensacola_Detailed_24_1.png
[19]:
# How many stations now?
len(stations_clipped)
[19]:
1419
[20]:
# To save the results to a shapefile
#import os
#path = ''  #specify the path (folder/directory) to save it to
#stations_clipped.to_file(os.path.join(path, 'PPBEP_stations.shp'))

Retrieve Characteristic Data

[21]:
# Now query for results
query['dataProfile'] = 'narrowResult'
res_narrow, md_narrow = wqp.get_results(**query)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/dataretrieval/wqp.py:83: DtypeWarning: Columns (10,13,15,17,19,20,21,22,23,28,31,33,34,36,58,60,61,64,65,69,70,71,72,73) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(StringIO(response.text), delimiter=',')
[22]:
df = res_narrow
df
[22]:
OrganizationIdentifier OrganizationFormalName ActivityIdentifier ActivityStartDate ActivityStartTime/Time ActivityStartTime/TimeZoneCode MonitoringLocationIdentifier ResultIdentifier DataLoggerLine ResultDetectionConditionText ... AnalysisEndTime/TimeZoneCode ResultLaboratoryCommentCode ResultLaboratoryCommentText ResultDetectionQuantitationLimitUrl LaboratoryAccreditationIndicator LaboratoryAccreditationAuthorityName TaxonomistAccreditationIndicator TaxonomistAccreditationAuthorityName LabSamplePreparationUrl ProviderName
0 21FLSEAS_WQX Florida Department of Environmental Protection 21FLSEAS_WQX-027950424132 2013-04-24 09:01:00 EST 21FLSEAS_WQX-02SEAS795 STORET-310551339 NaN NaN ... NaN NaN NaN https://www.waterqualitydata.us/data/providers... NaN NaN NaN NaN NaN STORET
1 21FLSEAS_WQX Florida Department of Environmental Protection 21FLSEAS_WQX-027400613134 2013-06-13 10:01:00 EST 21FLSEAS_WQX-02SEAS740 STORET-310489836 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
2 21FLPNS_WQX FL Dept. of Environmental Protection, Northwes... 21FLPNS_WQX-1536988F1 2013-09-17 11:01:00 EST 21FLPNS_WQX-33030019 STORET-308146602 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
3 21FLPNS_WQX FL Dept. of Environmental Protection, Northwes... 21FLPNS_WQX-1520810L 2013-07-23 10:01:00 EST 21FLPNS_WQX-33020146 STORET-308157908 NaN NaN ... NaN NaN NaN https://www.waterqualitydata.us/data/providers... NaN NaN NaN NaN NaN STORET
4 21FLCBA_WQX CHOCTAWHATCHEE BASIN ALLIANCE 21FLCBA_WQX-BAS219848-162813 2013-09-23 17:15:00 CST 21FLCBA_WQX-BAS02 STORET-760593202 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN STORET
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
418454 USGS-FL USGS Florida Water Science Center nwisfl.01.95900924 1959-02-04 NaN NaN USGS-303745086442101 NWIS-98928104 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
418455 USGS-FL USGS Florida Water Science Center nwisfl.01.95800572 1958-01-14 09:20:00 CST USGS-02376108 NWIS-6891392 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
418456 USGS-FL USGS Florida Water Science Center nwisfl.01.95800572 1958-01-14 09:20:00 CST USGS-02376108 NWIS-6891396 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
418457 USGS-FL USGS Florida Water Science Center nwisfl.01.95900926 1959-02-05 NaN NaN USGS-303820086241802 NWIS-6916678 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS
418458 USGS-FL USGS Florida Water Science Center nwisfl.01.95900852 1959-02-02 NaN NaN USGS-302330086482001 NWIS-6916402 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NWIS

418459 rows × 78 columns

[23]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(df, stations_clipped)
legend_kwds = {"fmt": "{:.0f}", 'bbox_to_anchor':(1, 0.75)}
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
[23]:
<Axes: >
../_images/notebooks_Harmonize_Pensacola_Detailed_30_1.png

Harmonize Characteristic Results

Two options for functions to harmonize characteristics: harmonize_all() or harmonize_generic(). harmonize_all runs functions on all characteristics and lets you specify how to handle errors harmonize_generic runs functions only on the characteristic specified with char_val and lets you also choose output units, to keep intermediate columns and to do a quick report summarizing changes.

[24]:
# See Documentation
#harmonize.harmonize_all?
#harmonize.harmonize?
secchi disk depth
[25]:
# Each harmonize function has optional params, e.g., char_val is the characticName column value to use so we can send the entire df.
# Optional params: units='m', char_val='Depth, Secchi disk depth', out_col='Secchi', report=False)

# We start by demonstrating on secchi disk depth (units default to m, keep intermediate fields, see report)
df = harmonize.harmonize(df, 'Depth, Secchi disk depth', intermediate_columns=True, report=True)
-Usable results-
count    14183.000000
mean         1.202800
std          2.343619
min          0.000000
25%          0.600000
50%          1.000000
75%          1.500000
max        260.000000
dtype: float64
Unusable results: 79
Usable results with inferred units: 0
Results outside threshold (0.0 to 15.264515528912716): 1
../_images/notebooks_Harmonize_Pensacola_Detailed_35_1.png

The threshold is based on standard deviations and is currently only used in the histogram.

[26]:
# Look at a table of just Secchi results and focus on subset of columns
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Units']
sechi_results = df.loc[df['CharacteristicName']=='Depth, Secchi disk depth', cols + ['Secchi']]
sechi_results
[26]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Units Secchi
83 21AWIC-7290 .94 m NaN m 0.94 meter
95 21FLPNS_WQX-33020JF1 0.60 m NaN m 0.6 meter
112 21FLGW_WQX-3565 .3 m NaN m 0.3 meter
122 21FLBFA_WQX-33010016 1.5 m NaN m 1.5 meter
125 21FLBFA_WQX-33010030 1.25 m NaN m 1.25 meter
... ... ... ... ... ... ...
418310 21FLESC_WQX-1004D-24Q1B 2.8 m NaN m 2.8 meter
418327 21FLPNS_WQX-G5NW0163 0.3 m NaN m 0.3 meter
418334 21FLESC_WQX-797-24Q1A 1.3 m NaN m 1.3 meter
418338 21FLPNS_WQX-G4NW0022 0.58 m NaN m 0.58 meter
418340 21FLESC_WQX-GRANDEBAYOU50 2.2 m NaN m 2.2 meter

14262 rows × 6 columns

[27]:
# Look at unusable(NAN) results
sechi_results.loc[df['Secchi'].isna()]
[27]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Units Secchi
164530 21FLKWAT_WQX-OKA-CBA-GAP-3-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
164544 21FLCBA_WQX-OKA-CB-BASS-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
164872 21FLCBA_WQX-OKA-CBA-GAP-3-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
165070 21FLCBA_WQX-OKA-CB-BASS-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
166146 21FLKWAT_WQX-OKA-CB-BASS-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
... ... ... ... ... ... ...
406356 21FLKWAT_WQX-SAN-SKI WATCH-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
406469 21FLCBA_WQX-OKA-CBA-GAP-1-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
407067 21FLCBA_WQX-OKA-CB-BASS-2 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
407084 21FLCBA_WQX-OKA-CBA-GAP-3-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
407462 21FLKWAT_WQX-SAN-SKI WATCH-5 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN

79 rows × 6 columns

[28]:
# look at the QA flag for first row from above
list(sechi_results.loc[df['Secchi'].isna()]['QA_flag'])[0]
[28]:
'ResultMeasureValue: "Not Reported" result cannot be used; ResultMeasure/MeasureUnitCode: MISSING UNITS, m assumed'
[29]:
# All cases where there was a QA flag
sechi_results.loc[df['QA_flag'].notna()]
[29]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Units Secchi
164530 21FLKWAT_WQX-OKA-CBA-GAP-3-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
164544 21FLCBA_WQX-OKA-CB-BASS-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
164872 21FLCBA_WQX-OKA-CBA-GAP-3-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
165070 21FLCBA_WQX-OKA-CB-BASS-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
166146 21FLKWAT_WQX-OKA-CB-BASS-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
... ... ... ... ... ... ...
406356 21FLKWAT_WQX-SAN-SKI WATCH-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
406469 21FLCBA_WQX-OKA-CBA-GAP-1-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
407067 21FLCBA_WQX-OKA-CB-BASS-2 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
407084 21FLCBA_WQX-OKA-CBA-GAP-3-1 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN
407462 21FLKWAT_WQX-SAN-SKI WATCH-5 Not Reported NaN ResultMeasureValue: "Not Reported" result cann... m NaN

79 rows × 6 columns

If both value and unit are missing nothing can be done, a unitless (NaN) value is assumed as to be in default units but a QA_flag is added

[30]:
# Aggregate Secchi data by station
visualize.station_summary(sechi_results, 'Secchi')
[30]:
MonitoringLocationIdentifier cnt mean
0 11NPSWRD_WQX-GUIS_CMP_PKT01 12 2.333333
1 11NPSWRD_WQX-GUIS_CMP_PKT02 17 2.411765
2 11NPSWRD_WQX-GUIS_CMP_PKT03 3 2.333333
3 21AWIC-1063 120 0.775167
4 21AWIC-1122 59 2.892102
... ... ... ...
1040 NARS_WQX-NCCA10-1432 1 1.075000
1041 NARS_WQX-NCCA10-1433 1 1.423333
1042 NARS_WQX-NCCA10-1434 1 2.400000
1043 NARS_WQX-NCCA10-1488 1 0.736667
1044 NARS_WQX-NCCA10-2432 1 1.600000

1045 rows × 3 columns

[31]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(sechi_results, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/mapclassify/classifiers.py:1653: UserWarning: Not enough unique values in array to form 5 classes. Setting k to 4.
  self.bins = quantile(y, k=k)
[31]:
<Axes: >
../_images/notebooks_Harmonize_Pensacola_Detailed_43_2.png
[32]:
# Map average secchi depth results at each station
gdf_avg = visualize.map_measure(sechi_results, stations_clipped, 'Secchi')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[32]:
<Axes: >
../_images/notebooks_Harmonize_Pensacola_Detailed_44_1.png
Temperature

The default error=’raise’, makes it so that there is an error when there is a dimensionality error (i.e. when units can’t be converted). Here we would get the error: DimensionalityError: Cannot convert from ‘count’ (dimensionless) to ‘degree_Celsius’ ([temperature])

[33]:
#'Temperature, water'
# errors=‘ignore’, invalid dimension conversions will return the NaN.
df = harmonize.harmonize(df, 'Temperature, water', intermediate_columns=True, report=True, errors='ignore')
-Usable results-
count    83725.000000
mean        21.956358
std         10.266391
min        -12.944444
25%         17.000000
50%         22.200000
75%         27.140000
max       1876.000000
dtype: float64
Unusable results: 2
Usable results with inferred units: 10
Results outside threshold (0.0 to 83.55470407525675): 6
../_images/notebooks_Harmonize_Pensacola_Detailed_47_1.png
[34]:
# Look at what was changed
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Temperature', 'Units']
temperature_results = df.loc[df['CharacteristicName']=='Temperature, water', cols]
temperature_results
[34]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Temperature Units
4 21FLCBA_WQX-BAS02 78.9 deg F NaN 26.0555555555556 degree_Celsius degF
6 21FLPNS_WQX-33020J10 12.35 deg C NaN 12.35 degree_Celsius degC
21 21FLSEAS_WQX-02SEAS810 23 deg C NaN 23.0 degree_Celsius degC
22 21FLPNS_WQX-33010G10 11.23 deg C NaN 11.23 degree_Celsius degC
37 AWW_WQX-aww_2174 27 deg C NaN 27.0 degree_Celsius degC
... ... ... ... ... ... ...
418444 USGS-302420087140301 23.0 deg C NaN 23.0 degree_Celsius degC
418446 USGS-302703087133501 24.0 deg C NaN 24.0 degree_Celsius degC
418449 USGS-303733086441001 22.2 deg C NaN 22.2 degree_Celsius degC
418452 USGS-302703087133502 24.0 deg C NaN 24.0 degree_Celsius degC
418455 USGS-02376108 24.4 deg C NaN 24.4 degree_Celsius degC

83727 rows × 6 columns

In the above we can see examples where the results were in deg F and in the result field they’ve been converted into degree_Celsius

[35]:
# Examine missing units
temperature_results.loc[df['ResultMeasure/MeasureUnitCode'].isna()]
[35]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Temperature Units
148884 NARS_WQX-OWW04440-0401 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN degC
223742 21FLCBA-FWB05 79.8 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 79.8 degree_Celsius degC
223808 21FLCBA-FWB05 81.7 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 81.7 degree_Celsius degC
224689 21FLCBA-FWB02 82.1 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 82.1 degree_Celsius degC
224690 21FLCBA-FWB02 82.6 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 82.6 degree_Celsius degC
224691 21FLCBA-FWB02 71.8 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 71.8 degree_Celsius degC
224692 21FLCBA-FWB02 79.4 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 79.4 degree_Celsius degC
230373 21FLCBA-RIV02 74.2 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 74.2 degree_Celsius degC
230374 21FLCBA-RIV02 74.2 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 74.2 degree_Celsius degC
235004 21FLCBA-FWB01 83.3 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 83.3 degree_Celsius degC
235088 21FLCBA-FWB01 71.2 NaN ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 71.2 degree_Celsius degC

We can see where the units were missing, the results were assumed to be in degree_Celsius already

[36]:
# This is also noted in the QA_flag field
list(temperature_results.loc[df['ResultMeasure/MeasureUnitCode'].isna(), 'QA_flag'])[0]
[36]:
'ResultMeasureValue: missing (NaN) result; ResultMeasure/MeasureUnitCode: MISSING UNITS, degC assumed'
[37]:
# Look for any without usable results
temperature_results.loc[df['Temperature'].isna()]
[37]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Temperature Units
47436 11NPSWRD_WQX-GUIS_NALO NaN deg C ResultMeasureValue: missing (NaN) result NaN degC
148884 NARS_WQX-OWW04440-0401 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN degC
[38]:
# Aggregate temperature data by station
visualize.station_summary(temperature_results, 'Temperature')
[38]:
MonitoringLocationIdentifier cnt mean
0 11NPSWRD_WQX-GUIS_ADEM_ALPT 30 24.986667
1 11NPSWRD_WQX-GUIS_BCCA 1 36.800000
2 11NPSWRD_WQX-GUIS_BISA 32 22.696250
3 11NPSWRD_WQX-GUIS_BOPI 1 32.000000
4 11NPSWRD_WQX-GUIS_CMP_PKT01 20 25.125000
... ... ... ...
2469 UWFCEDB_WQX-SRC-AI31-22 10 23.020000
2470 UWFCEDB_WQX-SRC-AI36-22 10 23.230000
2471 UWFCEDB_WQX-SRC-AI42-22 10 22.650000
2472 UWFCEDB_WQX-SRC-AI44-22 10 22.890000
2473 UWFCEDB_WQX-SRC-AK41-22 8 21.462500

2474 rows × 3 columns

[39]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(temperature_results, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
[39]:
<Axes: >
../_images/notebooks_Harmonize_Pensacola_Detailed_55_1.png
[40]:
# Map average temperature results at each station
gdf_temperature = visualize.map_measure(temperature_results, stations_clipped, 'Temperature')
gdf_temperature.plot(column='mean', cmap='OrRd', legend=True)
[40]:
<Axes: >
../_images/notebooks_Harmonize_Pensacola_Detailed_56_1.png

Dissolved oxygen

[41]:
# look at Dissolved oxygen (DO), but this time without intermediate fields
df = harmonize.harmonize(df, 'Dissolved oxygen (DO)')

Note: Imediately when we run a harmonization function without the intermediate fields they’re deleted.

[42]:
# Look at what was changed
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'DO']
do_res = df.loc[df['CharacteristicName']=='Dissolved oxygen (DO)', cols]
do_res
[42]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag DO
1 21FLSEAS_WQX-02SEAS740 6.3 mg/L NaN 6.3 milligram / liter
5 21FLCMP_WQX-3201BM21 4.5 mg/L NaN 4.5 milligram / liter
12 21FLPNS_WQX-33030D71 6.64 mg/L NaN 6.64 milligram / liter
16 21FLBFA_WQX-33020057 1.17 mg/L NaN 1.17 milligram / liter
25 21FLNUTT_WQX-PB02 8.11 mg/L NaN 8.11 milligram / liter
... ... ... ... ... ...
418319 21AWIC-1122 8.261 mg/L NaN 8.261 milligram / liter
418324 21FLESC_WQX-MARCUSCRKW20 8.71 mg/L NaN 8.71 milligram / liter
418332 21FLESC_WQX-CRESCENTLAKE10 9.11 mg/L NaN 9.11 milligram / liter
418342 21FLPNS_WQX-G5NW0070 3.73 mg/L NaN 3.73 milligram / liter
418349 21FLPNS_WQX-G4NW0022 9.28 mg/L NaN 9.28 milligram / liter

66208 rows × 5 columns

[43]:
do_res.loc[do_res['ResultMeasure/MeasureUnitCode']!='mg/l']
[43]:
MonitoringLocationIdentifier ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag DO
1 21FLSEAS_WQX-02SEAS740 6.3 mg/L NaN 6.3 milligram / liter
5 21FLCMP_WQX-3201BM21 4.5 mg/L NaN 4.5 milligram / liter
12 21FLPNS_WQX-33030D71 6.64 mg/L NaN 6.64 milligram / liter
16 21FLBFA_WQX-33020057 1.17 mg/L NaN 1.17 milligram / liter
25 21FLNUTT_WQX-PB02 8.11 mg/L NaN 8.11 milligram / liter
... ... ... ... ... ...
418319 21AWIC-1122 8.261 mg/L NaN 8.261 milligram / liter
418324 21FLESC_WQX-MARCUSCRKW20 8.71 mg/L NaN 8.71 milligram / liter
418332 21FLESC_WQX-CRESCENTLAKE10 9.11 mg/L NaN 9.11 milligram / liter
418342 21FLPNS_WQX-G5NW0070 3.73 mg/L NaN 3.73 milligram / liter
418349 21FLPNS_WQX-G4NW0022 9.28 mg/L NaN 9.28 milligram / liter

43130 rows × 5 columns

Though there were no results in %, the conversion from percent saturation (%) to mg/l is special. This equation is being improved by integrating tempertaure and pressure instead of assuming STP (see DO_saturation())

[44]:
# Aggregate DO data by station
visualize.station_summary(do_res, 'DO')
[44]:
MonitoringLocationIdentifier cnt mean
0 11NPSWRD_WQX-GUIS_ADEM_ALPT 30 6.698000
1 11NPSWRD_WQX-GUIS_BCCA 1 0.270000
2 11NPSWRD_WQX-GUIS_BISA 32 7.194375
3 11NPSWRD_WQX-GUIS_BOPI 1 7.540000
4 11NPSWRD_WQX-GUIS_FPPO 1 9.950000
... ... ... ...
1999 UWFCEDB_WQX-SRC-AI31-22 20 3.723390
2000 UWFCEDB_WQX-SRC-AI36-22 20 3.513705
2001 UWFCEDB_WQX-SRC-AI42-22 20 3.677337
2002 UWFCEDB_WQX-SRC-AI44-22 20 3.658370
2003 UWFCEDB_WQX-SRC-AK41-22 16 2.706512

2004 rows × 3 columns

[45]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(do_res, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
[45]:
<Axes: >
../_images/notebooks_Harmonize_Pensacola_Detailed_64_1.png
[46]:
# Map Averages at each station
gdf_avg = visualize.map_measure(do_res, stations_clipped, 'DO')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[46]:
<Axes: >
../_images/notebooks_Harmonize_Pensacola_Detailed_65_1.png

pH

[47]:
# pH, this time looking at a report
df = harmonize.harmonize(df, 'pH', report=True)
-Usable results-
count    57414.000000
mean         7.341551
std          0.904250
min          0.500000
25%          6.860000
50%          7.680000
75%          8.000000
max         16.200000
dtype: float64
Unusable results: 51
Usable results with inferred units: 36
Results outside threshold (0.0 to 12.767052173396209): 1
../_images/notebooks_Harmonize_Pensacola_Detailed_67_1.png

Note the warnings that occur when a unit is not recognized by the package. These occur even when report=False. Future versions could include these as defined units for pH, but here it wouldn’t alter results.

[48]:
df.loc[df['CharacteristicName']=='pH', ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'pH']]
[48]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag pH
2 7.29 None NaN 7.29 dimensionless
19 8.09 None NaN 8.09 dimensionless
24 7.45 None NaN 7.45 dimensionless
27 6.57 None NaN 6.57 dimensionless
29 6.57 None NaN 6.57 dimensionless
... ... ... ... ...
418453 5.9 std units NaN 5.9 dimensionless
418454 7.9 std units NaN 7.9 dimensionless
418456 8.3 std units NaN 8.3 dimensionless
418457 6.6 std units NaN 6.6 dimensionless
418458 8.1 std units NaN 8.1 dimensionless

57465 rows × 4 columns

‘None’ is uninterpretable and replaced with NaN, which then gets replaced with ‘dimensionless’ since pH is unitless

Salinity

[49]:
# Salinity
df = harmonize.harmonize(df, 'Salinity', report=True)
-Usable results-
count    71658.000000
mean        15.999914
std        152.600750
min          0.000000
25%          6.000000
50%         16.200000
75%         23.400000
max      37782.000000
dtype: float64
Unusable results: 417
Usable results with inferred units: 10
Results outside threshold (0.0 to 931.6044110771666): 4
../_images/notebooks_Harmonize_Pensacola_Detailed_72_1.png
[50]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity']
df.loc[df['CharacteristicName']=='Salinity', cols]
[50]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Salinity
9 18.9 ppth NaN 18.9 Practical_Salinity_Units
11 11.82 ppth NaN 11.82 Practical_Salinity_Units
17 .03 ppt NaN 0.03 Practical_Salinity_Units
23 33.27 ppth NaN 33.27 Practical_Salinity_Units
28 25.16 ppth NaN 25.16 Practical_Salinity_Units
... ... ... ... ...
418348 0 PSS NaN 0.0 Practical_Salinity_Units
418350 .02 ppt NaN 0.02 Practical_Salinity_Units
418357 0.01 ppth NaN 0.01 Practical_Salinity_Units
418358 23.81 ppth NaN 23.81 Practical_Salinity_Units
418360 17.1252 ppt NaN 17.1252 Practical_Salinity_Units

72075 rows × 4 columns

Nitrogen

[51]:
# Nitrogen
df = harmonize.harmonize(df, 'Nitrogen', report=True)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'cm3/g' UNDEFINED UNIT for Nitrogen
  warn("WARNING: " + problem)
-Usable results-
count     109.000000
mean       26.920174
std       160.257726
min         0.000700
25%         0.410000
50%         0.629000
75%         1.120000
max      1630.000000
dtype: float64
Unusable results: 4
Usable results with inferred units: 0
Results outside threshold (0.0 to 988.466532186079): 1
../_images/notebooks_Harmonize_Pensacola_Detailed_75_2.png
[52]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Nitrogen']
df.loc[df['CharacteristicName']=='Nitrogen', cols]
[52]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Nitrogen
43096 0.3 mg/L NaN 0.3 milligram / liter
43352 0.36 mg/L NaN 0.36 milligram / liter
43447 0.33875 mg/L NaN 0.33875 milligram / liter
43539 0.53125 mg/L NaN 0.53125 milligram / liter
44119 135 mg/kg NaN 135.00000000000003 milligram / liter
... ... ... ... ...
415651 18.69 mg/l NaN 18.69 milligram / liter
415658 16.18 mg/l NaN 16.18 milligram / liter
415659 18.99 mg/l NaN 18.99 milligram / liter
415662 18.72 mg/l NaN 18.72 milligram / liter
415664 17.61 mg/l NaN 17.61 milligram / liter

113 rows × 4 columns

Conductivity

[53]:
# Conductivity
df = harmonize.harmonize(df, 'Conductivity', report=True)
-Usable results-
count     1818.000000
mean     17085.221414
std      16116.889030
min          0.040000
25%        130.000000
50%      16994.750000
75%      30306.650000
max      54886.200000
dtype: float64
Unusable results: 8
Usable results with inferred units: 0
Results outside threshold (0.0 to 113786.55559242623): 0
../_images/notebooks_Harmonize_Pensacola_Detailed_78_1.png
[54]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Conductivity']
df.loc[df['CharacteristicName']=='Conductivity', cols]
[54]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Conductivity
13 19204.2 umho/cm NaN 19204.2 microsiemens / centimeter
92 222.3 umho/cm NaN 222.3 microsiemens / centimeter
189 102.8 umho/cm NaN 102.8 microsiemens / centimeter
379 11017.5 umho/cm NaN 11017.5 microsiemens / centimeter
793 32 umho/cm NaN 32.0 microsiemens / centimeter
... ... ... ... ...
415851 110 umho/cm NaN 110.0 microsiemens / centimeter
415856 65 umho/cm NaN 65.0 microsiemens / centimeter
415858 110 umho/cm NaN 110.0 microsiemens / centimeter
415861 390 umho/cm NaN 390.0 microsiemens / centimeter
415864 65 umho/cm NaN 65.0 microsiemens / centimeter

1826 rows × 4 columns

Chlorophyll a

[55]:
# Chlorophyll a
df = harmonize.harmonize(df, 'Chlorophyll a', report=True)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'None' UNDEFINED UNIT for Chlorophyll
  warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'ug/cm2' UNDEFINED UNIT for Chlorophyll
  warn("WARNING: " + problem)
-Usable results-
count    9266.000000
mean        1.169613
std         1.200140
min        -0.840000
25%         0.008300
50%         0.970000
75%         1.850000
max         9.990000
dtype: float64
Unusable results: 589
Usable results with inferred units: 0
Results outside threshold (0.0 to 8.370452978777962): 8
../_images/notebooks_Harmonize_Pensacola_Detailed_81_2.png
[56]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Chlorophyll']
df.loc[df['CharacteristicName']=='Chlorophyll a', cols]
[56]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Chlorophyll
244 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN
547 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN
660 2.3 mg/m3 NaN 0.0023000000000000004 milligram / liter
1207 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN
1350 2.9 mg/m3 NaN 0.0029000000000000007 milligram / liter
... ... ... ... ...
417283 6.3 mg/m3 NaN 0.006300000000000001 milligram / liter
417307 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN
417407 1.1 mg/m3 NaN 0.0011000000000000003 milligram / liter
417987 4.4 mg/m3 NaN 0.004400000000000001 milligram / liter
418017 NaN NaN ResultMeasureValue: missing (NaN) result; Resu... NaN

9855 rows × 4 columns

Organic Carbon

[57]:
# Organic carbon (%)
df = harmonize.harmonize(df, 'Organic carbon', report=True)
-Usable results-
count      4820.000000
mean       1139.767819
std       11597.098499
min           0.000000
25%           2.700000
50%           4.400000
75%           8.300000
max      410000.000000
dtype: float64
Unusable results: 165
Usable results with inferred units: 0
Results outside threshold (0.0 to 70722.35881310055): 22
../_images/notebooks_Harmonize_Pensacola_Detailed_84_1.png
[58]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Carbon']
df.loc[df['CharacteristicName']=='Organic carbon', cols]
[58]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Carbon
121 2.6 mg/L NaN 2.6 milligram / liter
156 3.9 mg/L NaN 3.9 milligram / liter
206 5.2 mg/L NaN 5.2 milligram / liter
309 4.8 mg/L NaN 4.8 milligram / liter
338 1.0 mg/L NaN 1.0 milligram / liter
... ... ... ... ...
418239 10 mg/L NaN 10.0 milligram / liter
418288 14 mg/L NaN 14.0 milligram / liter
418300 3.2 mg/L NaN 3.2 milligram / liter
418313 3.1 mg/L NaN 3.1 milligram / liter
418330 4.297 mg/L NaN 4.297 milligram / liter

4985 rows × 4 columns

Turbidity

[59]:
# Turbidity (NTU)
df = harmonize.harmonize(df, 'Turbidity', report=True)
-Usable results-
count    40718.000000
mean         8.926798
std        203.564637
min         -0.840000
25%          1.420000
50%          2.590000
75%          4.900000
max      32342.452300
dtype: float64
Unusable results: 596
Usable results with inferred units: 10
Results outside threshold (0.0 to 1230.3146187462137): 45
../_images/notebooks_Harmonize_Pensacola_Detailed_87_1.png
[60]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Turbidity']
df.loc[df['CharacteristicName']=='Turbidity', cols]
[60]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Turbidity
26 0 NTU NaN 0.0 Nephelometric_Turbidity_Units
45 1.4 NTU NaN 1.4 Nephelometric_Turbidity_Units
50 4.7 NTU NaN 4.7 Nephelometric_Turbidity_Units
101 5.6 NTU NaN 5.6 Nephelometric_Turbidity_Units
103 28 NTU NaN 28.0 Nephelometric_Turbidity_Units
... ... ... ... ...
418351 0.73 NTU NaN 0.73 Nephelometric_Turbidity_Units
418362 4.7 NTU NaN 4.7 Nephelometric_Turbidity_Units
418374 0.0 JTU NaN -0.0477 Nephelometric_Turbidity_Units
418393 50 JTU NaN 951.2022999999999 Nephelometric_Turbidity_Units
418401 1700 JTU NaN 32342.452299999997 Nephelometric_Turbidity_Units

41314 rows × 4 columns

Sediment

[61]:
# Sediment
df = harmonize.harmonize(df, 'Sediment', report=False)
[62]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Sediment']
df.loc[df['CharacteristicName']=='Sediment', cols]
[62]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Sediment

Phosphorus

Note: must be merged w/ activities (package runs query by site if not already merged)

[63]:
# Phosphorus
df = harmonize.harmonize(df, 'Phosphorus')
2 Phosphorus sample fractions not in frac_dict
2 Phosphorus sample fractions not in frac_dict found in expected domains, mapped to "Other_Phosphorus"

Note: warnings for unexpected characteristic fractions. Fractions are each seperated out into their own result column.

[64]:
# All Phosphorus
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'TDP_Phosphorus']
df.loc[df['Phosphorus'].notna(), cols]
[64]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag TDP_Phosphorus
35 .061 mg/L NaN NaN
66 0.03 mg/L NaN NaN
151 .13 mg/L NaN NaN
176 0.003 mg/L NaN NaN
315 0.002 mg/L NaN NaN
... ... ... ... ...
418214 .107 mg/L NaN NaN
418271 0.02 mg/L NaN NaN
418274 0.004 mg/L NaN NaN
418293 0.005 mg/L NaN NaN
418344 0.02 mg/L NaN NaN

6799 rows × 4 columns

[65]:
# Total phosphorus
df.loc[df['TP_Phosphorus'].notna(), cols]
[65]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag TDP_Phosphorus
35 .061 mg/L NaN NaN
66 0.03 mg/L NaN NaN
151 .13 mg/L NaN NaN
176 0.003 mg/L NaN NaN
315 0.002 mg/L NaN NaN
... ... ... ... ...
418214 .107 mg/L NaN NaN
418271 0.02 mg/L NaN NaN
418274 0.004 mg/L NaN NaN
418293 0.005 mg/L NaN NaN
418344 0.02 mg/L NaN NaN

6069 rows × 4 columns

[66]:
# Total dissolved phosphorus
df.loc[df['TDP_Phosphorus'].notna(), cols]
[66]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag TDP_Phosphorus
3684 0.019 mg/L NaN 0.019 milligram / liter
8180 0.002 mg/L NaN 0.002 milligram / liter
23210 0.003 mg/L NaN 0.003 milligram / liter
26305 0.019 mg/L NaN 0.019 milligram / liter
56249 0.002 mg/L NaN 0.002 milligram / liter
58177 0.017 mg/L NaN 0.017 milligram / liter
91512 0.021 mg/L NaN 0.021 milligram / liter
94313 0.003 mg/L NaN 0.003 milligram / liter
101207 0.020 mg/L NaN 0.02 milligram / liter
107151 0.002 mg/L NaN 0.002 milligram / liter
148518 0.00806 mg/L NaN 0.00806 milligram / liter
151929 0.000031 mg/L NaN 3.1e-05 milligram / liter
153142 0.002542 mg/L NaN 0.002542 milligram / liter
153726 0.00341 mg/L NaN 0.00341 milligram / liter
205233 0.00372 mg/L NaN 0.00372 milligram / liter
207271 0.00961 mg/L NaN 0.00961 milligram / liter
208186 0.00124 mg/L NaN 0.00124 milligram / liter
209016 0.01271 mg/L NaN 0.01271 milligram / liter
414049 0.030 mg/l as P NaN 0.03 milligram / liter
414058 0.033 mg/l as P NaN 0.033 milligram / liter
414061 0.024 mg/l as P NaN 0.024 milligram / liter
414067 0.028 mg/l as P NaN 0.028 milligram / liter
414076 0.021 mg/l as P NaN 0.021 milligram / liter
414081 0.023 mg/l as P NaN 0.023 milligram / liter
414093 0.037 mg/l as P NaN 0.037 milligram / liter
414158 0.023 mg/l as P NaN 0.023 milligram / liter
414166 0.02 mg/l as P NaN 0.02 milligram / liter
414182 0.04 mg/l as P NaN 0.04 milligram / liter
414197 0.03 mg/l as P NaN 0.03 milligram / liter
414206 0.025 mg/l as P NaN 0.025 milligram / liter
414233 0.05 mg/l as P NaN 0.05 milligram / liter
414242 0.15 mg/l as P NaN 0.15 milligram / liter
414262 0.03 mg/l as P NaN 0.03 milligram / liter
414379 0.02 mg/l as P NaN 0.02 milligram / liter
414401 0.07 mg/l as P NaN 0.07 milligram / liter
414409 0.08 mg/l as P NaN 0.08 milligram / liter
414424 0.02 mg/l as P NaN 0.02 milligram / liter
414444 0.02 mg/l as P NaN 0.02 milligram / liter
414456 0.04 mg/l as P NaN 0.04 milligram / liter
414476 0.02 mg/l as P NaN 0.02 milligram / liter
414489 0.05 mg/l as P NaN 0.05 milligram / liter
415374 0.03 mg/l as P NaN 0.03 milligram / liter
415381 0.05 mg/l as P NaN 0.05 milligram / liter
[67]:
# All other phosphorus sample fractions
df.loc[df['Other_Phosphorus'].notna(), cols]
[67]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag TDP_Phosphorus
10359 .5 mg/L NaN NaN
10435 .036 mg/L NaN NaN
11767 .089 mg/L NaN NaN
12727 .017 mg/L NaN NaN
13836 .035 mg/L NaN NaN
... ... ... ... ...
415820 .18 mg/L NaN NaN
415829 .25 mg/L NaN NaN
415843 .16 mg/L NaN NaN
415846 .18 mg/L NaN NaN
415855 .31 mg/L NaN NaN

687 rows × 4 columns

Bacteria

Some equivalence assumptions are built-in where bacteria counts that are not equivalent are treated as such because there is no standard way to convert from one to another.

Fecal Coliform

[68]:
# Known unit with bad dimensionality ('Colony_Forming_Units * milliliter')
df = harmonize.harmonize(df, 'Fecal Coliform', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
-Usable results-
count    10035.000000
mean        45.537618
std        448.839329
min          0.000000
25%          4.000000
50%          8.000000
75%         33.000000
max      33000.000000
dtype: float64
Unusable results: 40571
Usable results with inferred units: 0
Results outside threshold (0.0 to 2738.573594138782): 6
../_images/notebooks_Harmonize_Pensacola_Detailed_103_2.png
[69]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Fecal_Coliform']
df.loc[df['CharacteristicName']=='Fecal Coliform', cols]
[69]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Fecal_Coliform
0 *Non-detect NaN ResultMeasureValue: "*Non-detect" result canno... NaN
3 80 cfu/100mL NaN NaN
7 *Non-detect NaN ResultMeasureValue: "*Non-detect" result canno... NaN
8 2 MPN/100mL NaN NaN
10 *Non-detect NaN ResultMeasureValue: "*Non-detect" result canno... NaN
... ... ... ... ...
415706 245 cfu/100mL NaN NaN
417067 260 cfu/100mL NaN NaN
417221 120 cfu/100mL NaN NaN
417281 11 cfu/100mL NaN NaN
417464 58 cfu/100mL NaN NaN

50606 rows × 4 columns

Escherichia coli

[70]:
# Known unit with bad dimensionality ('Colony_Forming_Units * milliliter')
df = harmonize.harmonize(df, 'Escherichia coli', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
  warn(f"WARNING: '{unit}' converted to NaN")
-Usable results-
count      22.000000
mean      501.863636
std       610.053260
min         4.000000
25%         9.500000
50%        77.500000
75%      1000.000000
max      1700.000000
dtype: float64
Unusable results: 8930
Usable results with inferred units: 0
Results outside threshold (0.0 to 4162.183198738116): 0
../_images/notebooks_Harmonize_Pensacola_Detailed_106_2.png
[71]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'E_coli']
df.loc[df['CharacteristicName']=='Escherichia coli', cols]
[71]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag E_coli
20 0 cfu/100mL NaN NaN
63 1000 cfu/100mL NaN NaN
69 33.3333333333333 cfu/100mL NaN NaN
82 0 cfu/100mL NaN NaN
136 0 cfu/100mL NaN NaN
... ... ... ... ...
418259 231 MPN/100mL NaN NaN
418297 85 MPN/100mL NaN NaN
418343 171 MPN/100mL NaN NaN
418353 30 MPN/100mL NaN NaN
418361 31 MPN/100mL NaN NaN

8952 rows × 4 columns

Combining Salinity and Conductivity

Convert module has various functions to convert from one unit or characteristic to another. Some of these are used within a single characteristic during harmonization (e.g. DO saturation to concentration) while others are intended to model one characteristic as an indicator of another (e.g. estimate salinity from conductivity).

Note: this should only be done after both characteristic fields have been harmonized. Results before and after should be inspected, thresholds for outliers applied, and consider adding a QA_flag for modeled data.

Explore Salinity results:

[72]:
from harmonize_wq import convert
[73]:
# Salinity summary statistics
lst = [x.magnitude for x in list(df['Salinity'].dropna())]
q_sum = sum(lst)
print('Range: {} to {}'.format(min(lst), max(lst)))
print('Results: {} \nMean: {} PSU'.format(len(lst), q_sum/len(lst)))
Range: 0.0 to 37782.0
Results: 71658
Mean: 15.999914010997138 PSU
[74]:
# Identify extreme outliers
[x for x in lst if x >3200]
[74]:
[15030.0, 37782.0]

Other fields like units and QA_flag may help understand what caused high values and what results might need to be dropped from consideration

[75]:
# Columns to focus on
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity']
[76]:
# Look at important fields for max 5 values
salinity_series = df['Salinity'][df['Salinity'].notna()]
salinity_series.sort_values(ascending=False, inplace=True)
df[cols][df['Salinity'].isin(salinity_series[0:5])]
[76]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Salinity
21771 15030 ppt NaN 15030.0 Practical_Salinity_Units
31254 322 ppth NaN 322.0 Practical_Salinity_Units
42625 2150 ppth NaN 2150.0 Practical_Salinity_Units
69596 37782 ppth NaN 37782.0 Practical_Salinity_Units
143785 2190 ppt NaN 2190.0 Practical_Salinity_Units

Detection limits may help understand what caused low values and what results might need to be dropped or updated

[77]:
from harmonize_wq import wrangle
[78]:
df = wrangle.add_detection(df, 'Salinity')
cols+=['ResultDetectionConditionText',
       'DetectionQuantitationLimitTypeName',
       'DetectionQuantitationLimitMeasure/MeasureValue',
       'DetectionQuantitationLimitMeasure/MeasureUnitCode']
[79]:
# Look at important fields for min 5 values (often multiple 0.0)
df[cols][df['Salinity'].isin(salinity_series[-5:])]
[79]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Salinity ResultDetectionConditionText DetectionQuantitationLimitTypeName DetectionQuantitationLimitMeasure/MeasureValue DetectionQuantitationLimitMeasure/MeasureUnitCode
2168 0.00 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
3808 0.00 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
18985 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
27307 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
32679 0.0 ppth NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
418176 0 PSS NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
418180 0 PSS NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
418191 0 PSS NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
418318 0 PSS NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN
418348 0 PSS NaN 0.0 Practical_Salinity_Units NaN NaN NaN NaN

2909 rows × 8 columns

Explore Conductivity results:

[80]:
# Create series and inspect Conductivity values
cond_series = df['Conductivity'].dropna()
cond_series
[80]:
13        19204.2 microsiemens / centimeter
92          222.3 microsiemens / centimeter
189         102.8 microsiemens / centimeter
379       11017.5 microsiemens / centimeter
793          32.0 microsiemens / centimeter
                        ...
415851      110.0 microsiemens / centimeter
415856       65.0 microsiemens / centimeter
415858      110.0 microsiemens / centimeter
415861      390.0 microsiemens / centimeter
415864       65.0 microsiemens / centimeter
Name: Conductivity, Length: 1818, dtype: object

Conductivity thresholds from Freshwater Explorer: 10 > x < 5000 us/cm, use a higher threshold for coastal waters

[81]:
# Sort and check other relevant columns before converting (e.g. Salinity)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity', 'Conductivity']
df.sort_values(by=['Conductivity'], ascending=False, inplace=True)
df.loc[df['Conductivity'].notna(), cols]
[81]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Salinity Conductivity
115517 54886.2 umho/cm NaN NaN 54886.2 microsiemens / centimeter
118761 54871.3 umho/cm NaN NaN 54871.3 microsiemens / centimeter
112192 54860.6 umho/cm NaN NaN 54860.6 microsiemens / centimeter
117526 54859.3 umho/cm NaN NaN 54859.3 microsiemens / centimeter
111578 54850.8 umho/cm NaN NaN 54850.8 microsiemens / centimeter
... ... ... ... ... ...
76805 6.8 umho/cm NaN NaN 6.8 microsiemens / centimeter
51581 2 umho/cm NaN NaN 2.0 microsiemens / centimeter
143904 2 umho/cm NaN NaN 2.0 microsiemens / centimeter
36860 1 umho/cm NaN NaN 1.0 microsiemens / centimeter
145111 .04 umho/cm NaN NaN 0.04 microsiemens / centimeter

1818 rows × 5 columns

[82]:
# Check other relevant columns before converting (e.g. Salinity)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity', 'Conductivity']
df.loc[df['Conductivity'].notna(), cols]
[82]:
ResultMeasureValue ResultMeasure/MeasureUnitCode QA_flag Salinity Conductivity
115517 54886.2 umho/cm NaN NaN 54886.2 microsiemens / centimeter
118761 54871.3 umho/cm NaN NaN 54871.3 microsiemens / centimeter
112192 54860.6 umho/cm NaN NaN 54860.6 microsiemens / centimeter
117526 54859.3 umho/cm NaN NaN 54859.3 microsiemens / centimeter
111578 54850.8 umho/cm NaN NaN 54850.8 microsiemens / centimeter
... ... ... ... ... ...
76805 6.8 umho/cm NaN NaN 6.8 microsiemens / centimeter
51581 2 umho/cm NaN NaN 2.0 microsiemens / centimeter
143904 2 umho/cm NaN NaN 2.0 microsiemens / centimeter
36860 1 umho/cm NaN NaN 1.0 microsiemens / centimeter
145111 .04 umho/cm NaN NaN 0.04 microsiemens / centimeter

1818 rows × 5 columns

[83]:
# Convert values to PSU and write to Salinity
cond_series = cond_series.apply(str)  # Convert to string to convert to dimensionless (PSU)
df.loc[df['Conductivity'].notna(), 'Salinity'] = cond_series.apply(convert.conductivity_to_PSU)
df.loc[df['Conductivity'].notna(), 'Salinity']
[83]:
115517    36.356 dimensionless
118761    36.345 dimensionless
112192    36.338 dimensionless
117526    36.336 dimensionless
111578     36.33 dimensionless
                  ...
76805      0.013 dimensionless
51581      0.012 dimensionless
143904     0.012 dimensionless
36860      0.012 dimensionless
145111     0.012 dimensionless
Name: Salinity, Length: 1818, dtype: object

Datetime

datetime() formats time using dataretrieval and ActivityStart

[84]:
# First inspect the existing unformated fields
cols = ['ActivityStartDate', 'ActivityStartTime/Time', 'ActivityStartTime/TimeZoneCode']
df[cols]
[84]:
ActivityStartDate ActivityStartTime/Time ActivityStartTime/TimeZoneCode
115517 2007-08-09 12:15:00 CST
118761 2007-08-09 12:15:00 CST
112192 2007-08-09 12:15:00 CST
117526 2007-08-09 12:15:00 CST
111578 2007-08-09 12:15:00 CST
... ... ... ...
418454 1959-02-04 NaN NaN
418455 1958-01-14 09:20:00 CST
418456 1958-01-14 09:20:00 CST
418457 1959-02-05 NaN NaN
418458 1959-02-02 NaN NaN

418459 rows × 3 columns

[85]:
# 'ActivityStartDate' presserves date where 'Activity_datetime' is NAT due to no time zone
df = clean.datetime(df)
df[['ActivityStartDate', 'Activity_datetime']]
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/dataretrieval/utils.py:87: UserWarning: Warning: 35999 incomplete dates found, consider setting datetime_index to False.
  warnings.warn(
[85]:
ActivityStartDate Activity_datetime
115517 2007-08-09 2007-08-09 18:15:00+00:00
118761 2007-08-09 2007-08-09 18:15:00+00:00
112192 2007-08-09 2007-08-09 18:15:00+00:00
117526 2007-08-09 2007-08-09 18:15:00+00:00
111578 2007-08-09 2007-08-09 18:15:00+00:00
... ... ...
418454 1959-02-04 NaT
418455 1958-01-14 1958-01-14 15:20:00+00:00
418456 1958-01-14 1958-01-14 15:20:00+00:00
418457 1959-02-05 NaT
418458 1959-02-02 NaT

418459 rows × 2 columns

Activity_datetime combines all three time component columns into UTC. If time is missing this is NaT so a ActivityStartDate column is used to preserve date only.

Depth

Note: Data are often lacking sample depth metadata

[86]:
# Depth of sample (default units='meter')
df = clean.harmonize_depth(df)
#df.loc[df['ResultDepthHeightMeasure/MeasureValue'].dropna(), "Depth"]
df['ResultDepthHeightMeasure/MeasureValue'].dropna()
[86]:
1588       7.0
3668       7.0
78393      1.0
78895     16.0
79291     16.0
          ...
100784     1.3
100788     0.5
100823     2.0
100880     2.2
143344    35.0
Name: ResultDepthHeightMeasure/MeasureValue, Length: 179, dtype: float64

Characteristic to Column (long to wide format)

[87]:
# Split single QA column into multiple by characteristic (rename the result to preserve these QA_flags)
df2 = wrangle.split_col(df)
df2
[87]:
OrganizationIdentifier OrganizationFormalName ActivityIdentifier ActivityStartDate ActivityStartTime/Time ActivityStartTime/TimeZoneCode MonitoringLocationIdentifier ResultIdentifier DataLoggerLine ResultDetectionConditionText ... QA_Carbon QA_Salinity QA_TP_Phosphorus QA_TDP_Phosphorus QA_Other_Phosphorus QA_DO QA_Chlorophyll QA_Nitrogen QA_Temperature QA_E_coli
115517 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-51908_230231_173 2007-08-09 12:15:00 -0600 21AWIC-1122 STORET-170383613 230231.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
118761 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-51908_230230_173 2007-08-09 12:15:00 -0600 21AWIC-1122 STORET-170383607 230230.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
112192 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-51908_230228_173 2007-08-09 12:15:00 -0600 21AWIC-1122 STORET-170383595 230228.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
117526 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-51908_230229_173 2007-08-09 12:15:00 -0600 21AWIC-1122 STORET-170383601 230229.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
111578 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-51908_230227_173 2007-08-09 12:15:00 -0600 21AWIC-1122 STORET-170383589 230227.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
418454 USGS-FL USGS Florida Water Science Center nwisfl.01.95900924 1959-02-04 NaN NaN USGS-303745086442101 NWIS-98928104 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418455 USGS-FL USGS Florida Water Science Center nwisfl.01.95800572 1958-01-14 09:20:00 -0600 USGS-02376108 NWIS-6891392 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418456 USGS-FL USGS Florida Water Science Center nwisfl.01.95800572 1958-01-14 09:20:00 -0600 USGS-02376108 NWIS-6891396 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418457 USGS-FL USGS Florida Water Science Center nwisfl.01.95900926 1959-02-05 NaN NaN USGS-303820086241802 NWIS-6916678 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418458 USGS-FL USGS Florida Water Science Center nwisfl.01.95900852 1959-02-02 NaN NaN USGS-302330086482001 NWIS-6916402 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

366769 rows × 117 columns

[88]:
# This expands the single col (QA_flag) out to a number of new columns based on the unique characteristicNames and speciation
print('{} new columns'.format(len(df2.columns) - len(df.columns)))
14 new columns
[89]:
# Note: there are fewer rows because NAN results are also dropped in this step
print('{} fewer rows'.format(len(df)-len(df2)))
51690 fewer rows
[90]:
#Examine Carbon flags from earlier in notebook (note these are empty now because NAN is dropped)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'Carbon', 'QA_Carbon']
df2.loc[df2['QA_Carbon'].notna(), cols]
[90]:
ResultMeasureValue ResultMeasure/MeasureUnitCode Carbon QA_Carbon

Next the table is divided into the columns of interest (main_df) and characteristic specific metadata (chars_df)

[91]:
# split table into main and characteristics tables
main_df, chars_df = wrangle.split_table(df2)
[92]:
# Columns still in main table
main_df.columns
[92]:
Index(['OrganizationIdentifier', 'OrganizationFormalName',
       'ActivityIdentifier', 'MonitoringLocationIdentifier', 'ProviderName',
       'Secchi', 'Temperature', 'DO', 'pH', 'Salinity', 'Nitrogen',
       'Speciation', 'TOTAL NITROGEN_ MIXED FORMS', 'Conductivity',
       'Chlorophyll', 'Carbon', 'Turbidity', 'Sediment', 'Phosphorus',
       'TP_Phosphorus', 'TDP_Phosphorus', 'Other_Phosphorus', 'Fecal_Coliform',
       'E_coli', 'DetectionQuantitationLimitTypeName',
       'DetectionQuantitationLimitMeasure/MeasureValue',
       'DetectionQuantitationLimitMeasure/MeasureUnitCode',
       'Activity_datetime', 'Depth', 'QA_Fecal_Coliform', 'QA_Turbidity',
       'QA_pH', 'QA_Secchi', 'QA_Conductivity', 'QA_Carbon', 'QA_Salinity',
       'QA_TP_Phosphorus', 'QA_TDP_Phosphorus', 'QA_Other_Phosphorus', 'QA_DO',
       'QA_Chlorophyll', 'QA_Nitrogen', 'QA_Temperature', 'QA_E_coli'],
      dtype='object')
[93]:
# look at main table results (first 5)
main_df.head()
[93]:
OrganizationIdentifier OrganizationFormalName ActivityIdentifier MonitoringLocationIdentifier ProviderName Secchi Temperature DO pH Salinity ... QA_Carbon QA_Salinity QA_TP_Phosphorus QA_TDP_Phosphorus QA_Other_Phosphorus QA_DO QA_Chlorophyll QA_Nitrogen QA_Temperature QA_E_coli
115517 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-51908_230231_173 21AWIC-1122 STORET NaN NaN NaN NaN 36.356 dimensionless ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
118761 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-51908_230230_173 21AWIC-1122 STORET NaN NaN NaN NaN 36.345 dimensionless ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
112192 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-51908_230228_173 21AWIC-1122 STORET NaN NaN NaN NaN 36.338 dimensionless ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
117526 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-51908_230229_173 21AWIC-1122 STORET NaN NaN NaN NaN 36.336 dimensionless ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
111578 21AWIC ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... 21AWIC-51908_230227_173 21AWIC-1122 STORET NaN NaN NaN NaN 36.33 dimensionless ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 44 columns

[94]:
# Empty columns that could be dropped (Mostly QA columns)
cols = list(main_df.columns)
x = main_df.dropna(axis=1, how='all')
[col for col in cols if col not in x.columns]
[94]:
['Sediment',
 'QA_Fecal_Coliform',
 'QA_Secchi',
 'QA_Conductivity',
 'QA_Carbon',
 'QA_TP_Phosphorus',
 'QA_TDP_Phosphorus',
 'QA_Other_Phosphorus',
 'QA_E_coli']
[95]:
# Map average results at each station
gdf_avg = visualize.map_measure(main_df, stations_clipped, 'Temperature')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[95]:
<Axes: >
../_images/notebooks_Harmonize_Pensacola_Detailed_146_1.png