Pensacola Bay FL - Detailed step-by-step
Standardize, clean and wrangle Water Quality Portal data in Pensacola and Perdido Bays into more analytic-ready formats using the harmonize_wq package
US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonixe_wq package is intended to be a flexible water quality specific framework to help:
Identify differences in data units (including speciation and basis)
Identify differences in sampling or analytic methods
Resolve data errors using transparent assumptions
Reduce data to the columns that are most commonly needed
Transform data from long to wide format
Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.
Detailed step-by-step workflow
This example workflow takes a deeper dive into some of the expanded functionality to examine results for different water quality parameters in Pensacola and Perdido Bays
Install and import the required libraries
[1]:
import sys
#!python -m pip uninstall harmonize-wq --yes
#!python -m pip install harmonize-wq --yes
# Use pip to install the package from pypi or the latest from github
#!{sys.executable} -m pip install harmonize-wq
# For latest dev version
#!{sys.executable} -m pip install git+https://github.com/USEPA/harmonize-wq.git@new_release_0-3-8
[2]:
import dataretrieval.wqp as wqp
from harmonize_wq import wrangle
from harmonize_wq import location
from harmonize_wq import harmonize
from harmonize_wq import visualize
from harmonize_wq import clean
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/dataretrieval/nadp.py:44: UserWarning: GDAL not installed. Some functions will not work.
warnings.warn('GDAL not installed. Some functions will not work.')
Download location data using dataretrieval
[3]:
# Read geometry for Area of Interest from geojson file url and plot
aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson'
aoi_gdf = wrangle.as_gdf(aoi_url).to_crs(epsg=4326) # already standard 4326
aoi_gdf.plot()
[3]:
<Axes: >
[4]:
# Note there are actually two polygons (one for each Bay)
aoi_gdf
# Spatial query parameters can be updated to run just one
bBox = wrangle.get_bounding_box(aoi_gdf)
# For only one bay, e.g., first is Pensacola Bay:
#bBox = wrangle.get_bounding_box(aoi_gdf, 0)
[5]:
# Build query with characteristicNames and the AOI extent
query = {'characteristicName': ['Phosphorus',
'Temperature, water',
'Depth, Secchi disk depth',
'Dissolved oxygen (DO)',
'Salinity',
'pH',
'Nitrogen',
'Conductivity',
'Organic carbon',
'Chlorophyll a',
'Turbidity',
'Sediment',
'Fecal Coliform',
'Escherichia coli']}
query['bBox'] = bBox
[6]:
# Query stations (can be slow)
stations, site_md = wqp.what_sites(**query)
[7]:
# Rows and columns for results
stations.shape
[7]:
(2725, 37)
[8]:
# First 5 rows
stations.head()
[8]:
OrganizationIdentifier | OrganizationFormalName | MonitoringLocationIdentifier | MonitoringLocationName | MonitoringLocationTypeName | MonitoringLocationDescriptionText | HUCEightDigitCode | DrainageAreaMeasure/MeasureValue | DrainageAreaMeasure/MeasureUnitCode | ContributingDrainageAreaMeasure/MeasureValue | ... | AquiferName | LocalAqfrName | FormationTypeText | AquiferTypeName | ConstructionDateText | WellDepthMeasure/MeasureValue | WellDepthMeasure/MeasureUnitCode | WellHoleDepthMeasure/MeasureValue | WellHoleDepthMeasure/MeasureUnitCode | ProviderName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | USGS-AL | USGS Alabama Water Science Center | USGS-02376115 | ELEVENMILE CREEK NR WEST PENSACOLA, FL | Stream | NaN | 3140107.0 | 27.8 | sq mi | 27.8 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
1 | USGS-AL | USGS Alabama Water Science Center | USGS-02377570 | STYX RIVER NEAR ELSANOR, AL. | Stream | NaN | 3140106.0 | 192.0 | sq mi | 192.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
2 | USGS-AL | USGS Alabama Water Science Center | USGS-02377920 | BLACKWATER RIVER AT US HWY 90 NR ROBERTSDALE, AL. | Stream | NaN | 3140106.0 | 23.1 | sq mi | 23.1 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
3 | USGS-AL | USGS Alabama Water Science Center | USGS-02377960 | BLACKWATER RIVER AT CO RD 87 NEAR ELSANOR, AL. | Stream | NaN | 3140106.0 | 56.6 | sq mi | 56.6 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
4 | USGS-AL | USGS Alabama Water Science Center | USGS-02377975 | BLACKWATER RIVER ABOVE SEMINOLE AL | Stream | NaN | 3140106.0 | 40.2 | sq mi | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
5 rows × 37 columns
[9]:
# Columns used for an example row
stations.iloc[0][['HorizontalCoordinateReferenceSystemDatumName', 'LatitudeMeasure', 'LongitudeMeasure']]
[9]:
HorizontalCoordinateReferenceSystemDatumName NAD83
LatitudeMeasure 30.498252
LongitudeMeasure -87.335809
Name: 0, dtype: object
[10]:
# Harmonize location datums to 4326 (Note we keep intermediate columns using intermediate_columns=True)
stations_gdf = location.harmonize_locations(stations, out_EPSG=4326, intermediate_columns=True)
[11]:
location.harmonize_locations?
[12]:
# Rows and columns for results after running the function (5 new columns, only 2 new if intermediate_columns=False)
stations_gdf.shape
[12]:
(2725, 42)
[13]:
# Example results for the new columns
stations_gdf.iloc[0][['geom_orig', 'EPSG', 'QA_flag', 'geom', 'geometry']]
[13]:
geom_orig (-87.3358086, 30.49825159)
EPSG 4269.0
QA_flag NaN
geom POINT (-87.3358086 30.49825159)
geometry POINT (-87.3358086 30.49825159)
Name: 0, dtype: object
[14]:
# geom and geometry look the same but geometry is a special datatype
stations_gdf['geometry'].dtype
[14]:
<geopandas.array.GeometryDtype at 0x7f5f595a5e10>
[15]:
# Look at the different QA_flag flags that have been assigned,
# e.g., for bad datums or limited decimal precision
set(stations_gdf.loc[stations_gdf['QA_flag'].notna()]['QA_flag'])
[15]:
{'HorizontalCoordinateReferenceSystemDatumName: Bad datum OTHER, EPSG:4326 assumed',
'HorizontalCoordinateReferenceSystemDatumName: Bad datum UNKWN, EPSG:4326 assumed',
'LatitudeMeasure: Imprecise: lessthan3decimaldigits',
'LatitudeMeasure: Imprecise: lessthan3decimaldigits; HorizontalCoordinateReferenceSystemDatumName: Bad datum UNKWN, EPSG:4326 assumed',
'LatitudeMeasure: Imprecise: lessthan3decimaldigits; LongitudeMeasure: Imprecise: lessthan3decimaldigits',
'LongitudeMeasure: Imprecise: lessthan3decimaldigits',
'LongitudeMeasure: Imprecise: lessthan3decimaldigits; HorizontalCoordinateReferenceSystemDatumName: Bad datum UNKWN, EPSG:4326 assumed'}
[16]:
# Map it
stations_gdf.plot()
[16]:
<Axes: >
[17]:
# Clip to area of interest
stations_clipped = wrangle.clip_stations(stations_gdf, aoi_gdf)
[18]:
# Map it
stations_clipped.plot()
[18]:
<Axes: >
[19]:
# How many stations now?
len(stations_clipped)
[19]:
1305
[20]:
# To save the results to a shapefile
#import os
#path = '' #specify the path (folder/directory) to save it to
#stations_clipped.to_file(os.path.join(path, 'PPBEP_stations.shp'))
Retrieve Characteristic Data
[21]:
# Now query for results
query['dataProfile'] = 'narrowResult'
res_narrow, md_narrow = wqp.get_results(**query)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/dataretrieval/wqp.py:83: DtypeWarning: Columns (10,13,15,17,19,20,21,22,23,28,31,33,34,36,58,60,61,64,65,69,70,71,72,73) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv(StringIO(response.text), delimiter=',')
[22]:
df = res_narrow
df
[22]:
OrganizationIdentifier | OrganizationFormalName | ActivityIdentifier | ActivityStartDate | ActivityStartTime/Time | ActivityStartTime/TimeZoneCode | MonitoringLocationIdentifier | ResultIdentifier | DataLoggerLine | ResultDetectionConditionText | ... | AnalysisEndTime/TimeZoneCode | ResultLaboratoryCommentCode | ResultLaboratoryCommentText | ResultDetectionQuantitationLimitUrl | LaboratoryAccreditationIndicator | LaboratoryAccreditationAuthorityName | TaxonomistAccreditationIndicator | TaxonomistAccreditationAuthorityName | LabSamplePreparationUrl | ProviderName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21FLPNS_WQX | FL Dept. of Environmental Protection, Northwes... | 21FLPNS_WQX-1536988F1 | 2013-09-17 | 11:01:00 | EST | 21FLPNS_WQX-33030019 | STORET-308146602 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
1 | 21FLPNS_WQX | FL Dept. of Environmental Protection, Northwes... | 21FLPNS_WQX-1520810L | 2013-07-23 | 10:01:00 | EST | 21FLPNS_WQX-33020146 | STORET-308157908 | NaN | NaN | ... | NaN | NaN | NaN | https://www.waterqualitydata.us/data/providers... | NaN | NaN | NaN | NaN | NaN | STORET |
2 | 21FLCBA_WQX | CHOCTAWHATCHEE BASIN ALLIANCE | 21FLCBA_WQX-BAS219848-162813 | 2013-09-23 | 17:15:00 | CST | 21FLCBA_WQX-BAS02 | STORET-760593202 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
3 | 21FLSEAS_WQX | Florida Department of Environmental Protection | 21FLSEAS_WQX-028800618132 | 2013-06-18 | 11:01:00 | EST | 21FLSEAS_WQX-02SEAS880 | STORET-310466105 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
4 | 21FLBFA_WQX | FL Dept of Environmental Protection , Bream Fi... | 21FLBFA_WQX-1558337F1 | 2013-12-01 | 13:01:00 | EST | 21FLBFA_WQX-33020LT2 | STORET-291224060 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
398031 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-335811_1873104_173 | 2024-03-13 | 08:20:00 | CDT | 21AWIC-9768 | STORET-1039487062 | 1873104.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
398032 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-335642_1872423_173 | 2024-03-06 | 07:45:00 | CDT | 21AWIC-1208 | STORET-1039486410 | 1872423.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
398033 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-335642_1872426_173 | 2024-03-06 | 07:45:00 | CDT | 21AWIC-1208 | STORET-1039486431 | 1872426.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
398034 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-335811_1873106_173 | 2024-03-13 | 08:20:00 | CDT | 21AWIC-9768 | STORET-1039487073 | 1873106.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
398035 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-335728_1874876_173 | 2024-03-06 | 11:30:00 | CDT | 21AWIC-1152 | STORET-1039490686 | 1874876.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
398036 rows × 78 columns
[23]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(df, stations_clipped)
legend_kwds = {"fmt": "{:.0f}", 'bbox_to_anchor':(1, 0.75)}
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
[23]:
<Axes: >
Harmonize Characteristic Results
Two options for functions to harmonize characteristics: harmonize_all() or harmonize_generic(). harmonize_all runs functions on all characteristics and lets you specify how to handle errors harmonize_generic runs functions only on the characteristic specified with char_val and lets you also choose output units, to keep intermediate columns and to do a quick report summarizing changes.
[24]:
# See Documentation
#harmonize.harmonize_all?
#harmonize.harmonize?
secchi disk depth
[25]:
# Each harmonize function has optional params, e.g., char_val is the characticName column value to use so we can send the entire df.
# Optional params: units='m', char_val='Depth, Secchi disk depth', out_col='Secchi', report=False)
# We start by demonstrating on secchi disk depth (units default to m, keep intermediate fields, see report)
df = harmonize.harmonize(df, 'Depth, Secchi disk depth', intermediate_columns=True, report=True)
-Usable results-
count 13323.000000
mean 1.195995
std 2.406123
min 0.000000
25% 0.600000
50% 1.000000
75% 1.500000
max 260.000000
dtype: float64
Unusable results: 75
Usable results with inferred units: 0
Results outside threshold (0.0 to 15.632730548117232): 1
The threshold is based on standard deviations and is currently only used in the histogram.
[26]:
# Look at a table of just Secchi results and focus on subset of columns
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Units']
sechi_results = df.loc[df['CharacteristicName']=='Depth, Secchi disk depth', cols + ['Secchi']]
sechi_results
[26]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Units | Secchi | |
---|---|---|---|---|---|---|
53 | 21FLPNS_WQX-33020JF1 | 0.60 | m | NaN | m | 0.6 meter |
60 | 21FLGW_WQX-3565 | .3 | m | NaN | m | 0.3 meter |
64 | 21FLBFA_WQX-33010016 | 1.5 | m | NaN | m | 1.5 meter |
107 | 21AWIC-7290 | .94 | m | NaN | m | 0.94 meter |
130 | 21FLBFA_WQX-33010030 | 1.25 | m | NaN | m | 1.25 meter |
... | ... | ... | ... | ... | ... | ... |
397845 | 21AWIC-9630 | 2.65 | m | NaN | m | 2.65 meter |
397870 | 21AWIC-1122 | 5.54 | m | NaN | m | 5.54 meter |
397940 | 21AWIC-1063 | 1.14 | m | NaN | m | 1.14 meter |
397984 | 21AWIC-7290 | .65 | m | NaN | m | 0.65 meter |
398019 | 21AWIC-1208 | 1.22 | m | NaN | m | 1.22 meter |
13398 rows × 6 columns
[27]:
# Look at unusable(NAN) results
sechi_results.loc[df['Secchi'].isna()]
[27]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Units | Secchi | |
---|---|---|---|---|---|---|
147997 | 21FLCBA_WQX-OKA-CB-BASS-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
148044 | 21FLKWAT_WQX-OKA-CBA-GAP-3-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
148395 | 21FLCBA_WQX-OKA-CBA-GAP-3-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
148488 | 21FLCBA_WQX-OKA-CB-BASS-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
149503 | 21FLKWAT_WQX-OKA-CB-BASS-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
... | ... | ... | ... | ... | ... | ... |
394973 | 21FLKWAT_WQX-SAN-SKI WATCH-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
395025 | 21FLCBA_WQX-OKA-CBA-GAP-1-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
395292 | 21FLCBA_WQX-OKA-CB-BASS-2 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
395295 | 21FLCBA_WQX-OKA-CBA-GAP-3-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
395477 | 21FLKWAT_WQX-SAN-SKI WATCH-5 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
75 rows × 6 columns
[28]:
# look at the QA flag for first row from above
list(sechi_results.loc[df['Secchi'].isna()]['QA_flag'])[0]
[28]:
'ResultMeasureValue: "Not Reported" result cannot be used; ResultMeasure/MeasureUnitCode: MISSING UNITS, m assumed'
[29]:
# All cases where there was a QA flag
sechi_results.loc[df['QA_flag'].notna()]
[29]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Units | Secchi | |
---|---|---|---|---|---|---|
147997 | 21FLCBA_WQX-OKA-CB-BASS-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
148044 | 21FLKWAT_WQX-OKA-CBA-GAP-3-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
148395 | 21FLCBA_WQX-OKA-CBA-GAP-3-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
148488 | 21FLCBA_WQX-OKA-CB-BASS-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
149503 | 21FLKWAT_WQX-OKA-CB-BASS-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
... | ... | ... | ... | ... | ... | ... |
394973 | 21FLKWAT_WQX-SAN-SKI WATCH-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
395025 | 21FLCBA_WQX-OKA-CBA-GAP-1-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
395292 | 21FLCBA_WQX-OKA-CB-BASS-2 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
395295 | 21FLCBA_WQX-OKA-CBA-GAP-3-1 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
395477 | 21FLKWAT_WQX-SAN-SKI WATCH-5 | Not Reported | NaN | ResultMeasureValue: "Not Reported" result cann... | m | NaN |
75 rows × 6 columns
If both value and unit are missing nothing can be done, a unitless (NaN) value is assumed as to be in default units but a QA_flag is added
[30]:
# Aggregate Secchi data by station
visualize.station_summary(sechi_results, 'Secchi')
[30]:
MonitoringLocationIdentifier | cnt | mean | |
---|---|---|---|
0 | 11NPSWRD_WQX-GUIS_CMP_PKT01 | 12 | 2.333333 |
1 | 11NPSWRD_WQX-GUIS_CMP_PKT02 | 17 | 2.411765 |
2 | 11NPSWRD_WQX-GUIS_CMP_PKT03 | 3 | 2.333333 |
3 | 21AWIC-1063 | 118 | 0.773729 |
4 | 21AWIC-1122 | 58 | 2.859034 |
... | ... | ... | ... |
910 | NARS_WQX-NCCA10-1432 | 1 | 1.075000 |
911 | NARS_WQX-NCCA10-1433 | 1 | 1.423333 |
912 | NARS_WQX-NCCA10-1434 | 1 | 2.400000 |
913 | NARS_WQX-NCCA10-1488 | 1 | 0.736667 |
914 | NARS_WQX-NCCA10-2432 | 1 | 1.600000 |
915 rows × 3 columns
[31]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(sechi_results, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
[31]:
<Axes: >
[32]:
# Map average secchi depth results at each station
gdf_avg = visualize.map_measure(sechi_results, stations_clipped, 'Secchi')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[32]:
<Axes: >
Temperature
The default error=’raise’, makes it so that there is an error when there is a dimensionality error (i.e. when units can’t be converted). Here we would get the error: DimensionalityError: Cannot convert from ‘count’ (dimensionless) to ‘degree_Celsius’ ([temperature])
[33]:
#'Temperature, water'
# errors=‘ignore’, invalid dimension conversions will return the NaN.
df = harmonize.harmonize(df, 'Temperature, water', intermediate_columns=True, report=True, errors='ignore')
-Usable results-
count 80160.000000
mean 21.920816
std 10.423333
min -12.944444
25% 17.000000
50% 22.200000
75% 27.110000
max 1876.000000
dtype: float64
Unusable results: 2
Usable results with inferred units: 10
Results outside threshold (0.0 to 84.4608124163855): 6
[34]:
# Look at what was changed
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Temperature', 'Units']
temperature_results = df.loc[df['CharacteristicName']=='Temperature, water', cols]
temperature_results
[34]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Temperature | Units | |
---|---|---|---|---|---|---|
2 | 21FLCBA_WQX-BAS02 | 78.9 | deg F | NaN | 26.0555555555556 degree_Celsius | degF |
23 | 21FLPNS_WQX-33030019 | 23.12 | deg C | NaN | 23.12 degree_Celsius | degC |
36 | 21FLPNS_WQX-330300G9 | 16.43 | deg C | NaN | 16.43 degree_Celsius | degC |
38 | 21FLPNS_WQX-33010H24 | 29.78 | deg C | NaN | 29.78 degree_Celsius | degC |
40 | 21FLPNS_WQX-33010G10 | 26.72 | deg C | NaN | 26.72 degree_Celsius | degC |
... | ... | ... | ... | ... | ... | ... |
398023 | 21AWIC-1606 | 14.93 | deg C | NaN | 14.93 degree_Celsius | degC |
398028 | 21AWIC-9631 | 18.1337 | deg C | NaN | 18.1337 degree_Celsius | degC |
398030 | 21AWIC-7290 | 18.8128 | deg C | NaN | 18.8128 degree_Celsius | degC |
398032 | 21AWIC-1208 | 19.1005 | deg C | NaN | 19.1005 degree_Celsius | degC |
398033 | 21AWIC-1208 | 19.2875 | deg C | NaN | 19.2875 degree_Celsius | degC |
80162 rows × 6 columns
In the above we can see examples where the results were in deg F and in the result field they’ve been converted into degree_Celsius
[35]:
# Examine missing units
temperature_results.loc[df['ResultMeasure/MeasureUnitCode'].isna()]
[35]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Temperature | Units | |
---|---|---|---|---|---|---|
156307 | NARS_WQX-OWW04440-0401 | NaN | NaN | ResultMeasureValue: missing (NaN) result; Resu... | NaN | degC |
218976 | 21FLCBA-FWB05 | 79.8 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 79.8 degree_Celsius | degC |
219042 | 21FLCBA-FWB05 | 81.7 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 81.7 degree_Celsius | degC |
219923 | 21FLCBA-FWB02 | 82.1 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 82.1 degree_Celsius | degC |
219924 | 21FLCBA-FWB02 | 82.6 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 82.6 degree_Celsius | degC |
219925 | 21FLCBA-FWB02 | 71.8 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 71.8 degree_Celsius | degC |
219926 | 21FLCBA-FWB02 | 79.4 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 79.4 degree_Celsius | degC |
225607 | 21FLCBA-RIV02 | 74.2 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 74.2 degree_Celsius | degC |
225608 | 21FLCBA-RIV02 | 74.2 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 74.2 degree_Celsius | degC |
230238 | 21FLCBA-FWB01 | 83.3 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 83.3 degree_Celsius | degC |
230322 | 21FLCBA-FWB01 | 71.2 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 71.2 degree_Celsius | degC |
We can see where the units were missing, the results were assumed to be in degree_Celsius already
[36]:
# This is also noted in the QA_flag field
list(temperature_results.loc[df['ResultMeasure/MeasureUnitCode'].isna(), 'QA_flag'])[0]
[36]:
'ResultMeasureValue: missing (NaN) result; ResultMeasure/MeasureUnitCode: MISSING UNITS, degC assumed'
[37]:
# Look for any without usable results
temperature_results.loc[df['Temperature'].isna()]
[37]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Temperature | Units | |
---|---|---|---|---|---|---|
47448 | 11NPSWRD_WQX-GUIS_NALO | NaN | deg C | ResultMeasureValue: missing (NaN) result | NaN | degC |
156307 | NARS_WQX-OWW04440-0401 | NaN | NaN | ResultMeasureValue: missing (NaN) result; Resu... | NaN | degC |
[38]:
# Aggregate temperature data by station
visualize.station_summary(temperature_results, 'Temperature')
[38]:
MonitoringLocationIdentifier | cnt | mean | |
---|---|---|---|
0 | 11NPSWRD_WQX-GUIS_ADEM_ALPT | 30 | 24.986667 |
1 | 11NPSWRD_WQX-GUIS_BCCA | 1 | 36.800000 |
2 | 11NPSWRD_WQX-GUIS_BISA | 32 | 22.696250 |
3 | 11NPSWRD_WQX-GUIS_BOPI | 1 | 32.000000 |
4 | 11NPSWRD_WQX-GUIS_CMP_PKT01 | 20 | 25.125000 |
... | ... | ... | ... |
2325 | UWFCEDB_WQX-SRC-AI31-22 | 10 | 23.020000 |
2326 | UWFCEDB_WQX-SRC-AI36-22 | 10 | 23.230000 |
2327 | UWFCEDB_WQX-SRC-AI42-22 | 10 | 22.650000 |
2328 | UWFCEDB_WQX-SRC-AI44-22 | 10 | 22.890000 |
2329 | UWFCEDB_WQX-SRC-AK41-22 | 8 | 21.462500 |
2330 rows × 3 columns
[39]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(temperature_results, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
[39]:
<Axes: >
[40]:
# Map average temperature results at each station
gdf_temperature = visualize.map_measure(temperature_results, stations_clipped, 'Temperature')
gdf_temperature.plot(column='mean', cmap='OrRd', legend=True)
[40]:
<Axes: >
Dissolved oxygen
[41]:
# look at Dissolved oxygen (DO), but this time without intermediate fields
df = harmonize.harmonize(df, 'Dissolved oxygen (DO)')
Note: Imediately when we run a harmonization function without the intermediate fields they’re deleted.
[42]:
# Look at what was changed
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'DO']
do_res = df.loc[df['CharacteristicName']=='Dissolved oxygen (DO)', cols]
do_res
[42]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | DO | |
---|---|---|---|---|---|
7 | 21FLPNS_WQX-33030D71 | 6.64 | mg/L | NaN | 6.64 milligram / liter |
13 | 21FLNUTT_WQX-PB02 | 8.11 | mg/L | NaN | 8.11 milligram / liter |
21 | 21FLCBA_WQX-SRS03 | 6.28 | mg/L | NaN | 6.28 milligram / liter |
22 | 21FLPNS_WQX-33020K20 | 4.53 | mg/L | NaN | 4.53 milligram / liter |
26 | 21FLPNS_WQX-33030019 | 9.32 | mg/L | NaN | 9.32 milligram / liter |
... | ... | ... | ... | ... | ... |
398020 | 21AWIC-7290 | 1.0486 | mg/L | NaN | 1.0486 milligram / liter |
398021 | 21AWIC-9630 | 7.7673 | mg/L | NaN | 7.7673 milligram / liter |
398022 | 21AWIC-9631 | 7.3366 | mg/L | NaN | 7.3366 milligram / liter |
398025 | 21AWIC-7290 | .4967 | mg/L | NaN | 0.4967 milligram / liter |
398027 | 21AWIC-1122 | 8.261 | mg/L | NaN | 8.261 milligram / liter |
62645 rows × 5 columns
[43]:
do_res.loc[do_res['ResultMeasure/MeasureUnitCode']!='mg/l']
[43]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | DO | |
---|---|---|---|---|---|
7 | 21FLPNS_WQX-33030D71 | 6.64 | mg/L | NaN | 6.64 milligram / liter |
13 | 21FLNUTT_WQX-PB02 | 8.11 | mg/L | NaN | 8.11 milligram / liter |
21 | 21FLCBA_WQX-SRS03 | 6.28 | mg/L | NaN | 6.28 milligram / liter |
22 | 21FLPNS_WQX-33020K20 | 4.53 | mg/L | NaN | 4.53 milligram / liter |
26 | 21FLPNS_WQX-33030019 | 9.32 | mg/L | NaN | 9.32 milligram / liter |
... | ... | ... | ... | ... | ... |
398020 | 21AWIC-7290 | 1.0486 | mg/L | NaN | 1.0486 milligram / liter |
398021 | 21AWIC-9630 | 7.7673 | mg/L | NaN | 7.7673 milligram / liter |
398022 | 21AWIC-9631 | 7.3366 | mg/L | NaN | 7.3366 milligram / liter |
398025 | 21AWIC-7290 | .4967 | mg/L | NaN | 0.4967 milligram / liter |
398027 | 21AWIC-1122 | 8.261 | mg/L | NaN | 8.261 milligram / liter |
39567 rows × 5 columns
Though there were no results in %, the conversion from percent saturation (%) to mg/l is special. This equation is being improved by integrating tempertaure and pressure instead of assuming STP (see DO_saturation())
[44]:
# Aggregate DO data by station
visualize.station_summary(do_res, 'DO')
[44]:
MonitoringLocationIdentifier | cnt | mean | |
---|---|---|---|
0 | 11NPSWRD_WQX-GUIS_ADEM_ALPT | 30 | 6.698000 |
1 | 11NPSWRD_WQX-GUIS_BCCA | 1 | 0.270000 |
2 | 11NPSWRD_WQX-GUIS_BISA | 32 | 7.194375 |
3 | 11NPSWRD_WQX-GUIS_BOPI | 1 | 7.540000 |
4 | 11NPSWRD_WQX-GUIS_FPPO | 1 | 9.950000 |
... | ... | ... | ... |
1855 | UWFCEDB_WQX-SRC-AI31-22 | 20 | 3.723390 |
1856 | UWFCEDB_WQX-SRC-AI36-22 | 20 | 3.513705 |
1857 | UWFCEDB_WQX-SRC-AI42-22 | 20 | 3.677337 |
1858 | UWFCEDB_WQX-SRC-AI44-22 | 20 | 3.658370 |
1859 | UWFCEDB_WQX-SRC-AK41-22 | 16 | 2.706512 |
1860 rows × 3 columns
[45]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(do_res, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
[45]:
<Axes: >
[46]:
# Map Averages at each station
gdf_avg = visualize.map_measure(do_res, stations_clipped, 'DO')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[46]:
<Axes: >
pH
[47]:
# pH, this time looking at a report
df = harmonize.harmonize(df, 'pH', report=True)
-Usable results-
count 53851.000000
mean 7.350465
std 0.904778
min 0.500000
25% 6.890000
50% 7.700000
75% 8.000000
max 16.200000
dtype: float64
Unusable results: 51
Usable results with inferred units: 36
Results outside threshold (0.0 to 12.779133009876027): 1
Note the warnings that occur when a unit is not recognized by the package. These occur even when report=False. Future versions could include these as defined units for pH, but here it wouldn’t alter results.
[48]:
df.loc[df['CharacteristicName']=='pH', ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'pH']]
[48]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | pH | |
---|---|---|---|---|
0 | 7.29 | None | NaN | 7.29 dimensionless |
12 | 7.45 | None | NaN | 7.45 dimensionless |
15 | 6.57 | None | NaN | 6.57 dimensionless |
16 | 6.57 | None | NaN | 6.57 dimensionless |
18 | 7.72 | None | NaN | 7.72 dimensionless |
... | ... | ... | ... | ... |
398003 | 7.893 | None | NaN | 7.893 dimensionless |
398013 | 7.206 | None | NaN | 7.206 dimensionless |
398015 | 4.78 | None | NaN | 4.78 dimensionless |
398018 | 7.9835 | None | NaN | 7.9835 dimensionless |
398034 | 7.196 | None | NaN | 7.196 dimensionless |
53902 rows × 4 columns
‘None’ is uninterpretable and replaced with NaN, which then gets replaced with ‘dimensionless’ since pH is unitless
Salinity
[49]:
# Salinity
df = harmonize.harmonize(df, 'Salinity', report=True)
-Usable results-
count 68190.000000
mean 16.236781
std 156.409068
min 0.000000
25% 6.500000
50% 16.400000
75% 23.520000
max 37782.000000
dtype: float64
Unusable results: 416
Usable results with inferred units: 10
Results outside threshold (0.0 to 954.6911897215225): 4
[50]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity']
df.loc[df['CharacteristicName']=='Salinity', cols]
[50]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Salinity | |
---|---|---|---|---|
4 | 18.9 | ppth | NaN | 18.9 Practical_Salinity_Units |
6 | 11.82 | ppth | NaN | 11.82 Practical_Salinity_Units |
10 | .03 | ppt | NaN | 0.03 Practical_Salinity_Units |
17 | 0.50 | ppth | NaN | 0.5 Practical_Salinity_Units |
19 | 3.3 | ppth | NaN | 3.3 Practical_Salinity_Units |
... | ... | ... | ... | ... |
398017 | 31.5583 | ppt | NaN | 31.5583 Practical_Salinity_Units |
398026 | .3046 | ppt | NaN | 0.3046 Practical_Salinity_Units |
398029 | 17.7888 | ppt | NaN | 17.7888 Practical_Salinity_Units |
398031 | 17.1252 | ppt | NaN | 17.1252 Practical_Salinity_Units |
398035 | .02 | ppt | NaN | 0.02 Practical_Salinity_Units |
68606 rows × 4 columns
Nitrogen
[51]:
# Nitrogen
df = harmonize.harmonize(df, 'Nitrogen', report=True)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'cm3/g' UNDEFINED UNIT for Nitrogen
warn("WARNING: " + problem)
-Usable results-
count 109.000000
mean 26.920174
std 160.257726
min 0.000700
25% 0.410000
50% 0.629000
75% 1.120000
max 1630.000000
dtype: float64
Unusable results: 4
Usable results with inferred units: 0
Results outside threshold (0.0 to 988.466532186079): 1
[52]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Nitrogen']
df.loc[df['CharacteristicName']=='Nitrogen', cols]
[52]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Nitrogen | |
---|---|---|---|---|
43125 | 0.3 | mg/L | NaN | 0.3 milligram / liter |
43334 | 0.36 | mg/L | NaN | 0.36 milligram / liter |
43483 | 0.33875 | mg/L | NaN | 0.33875 milligram / liter |
43606 | 0.53125 | mg/L | NaN | 0.53125 milligram / liter |
44139 | 135 | mg/kg | NaN | 135.00000000000003 milligram / liter |
... | ... | ... | ... | ... |
396834 | 18.69 | mg/l | NaN | 18.69 milligram / liter |
396841 | 16.18 | mg/l | NaN | 16.18 milligram / liter |
396842 | 18.99 | mg/l | NaN | 18.99 milligram / liter |
396845 | 18.72 | mg/l | NaN | 18.72 milligram / liter |
396847 | 17.61 | mg/l | NaN | 17.61 milligram / liter |
113 rows × 4 columns
Conductivity
[53]:
# Conductivity
df = harmonize.harmonize(df, 'Conductivity', report=True)
-Usable results-
count 1818.000000
mean 17085.221414
std 16116.889030
min 0.040000
25% 130.000000
50% 16994.750000
75% 30306.650000
max 54886.200000
dtype: float64
Unusable results: 8
Usable results with inferred units: 0
Results outside threshold (0.0 to 113786.55559242623): 0
[54]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Conductivity']
df.loc[df['CharacteristicName']=='Conductivity', cols]
[54]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Conductivity | |
---|---|---|---|---|
8 | 19204.2 | umho/cm | NaN | 19204.2 microsiemens / centimeter |
50 | 222.3 | umho/cm | NaN | 222.3 microsiemens / centimeter |
232 | 102.8 | umho/cm | NaN | 102.8 microsiemens / centimeter |
394 | 11017.5 | umho/cm | NaN | 11017.5 microsiemens / centimeter |
746 | 32 | umho/cm | NaN | 32.0 microsiemens / centimeter |
... | ... | ... | ... | ... |
397047 | 110 | umho/cm | NaN | 110.0 microsiemens / centimeter |
397054 | 65 | umho/cm | NaN | 65.0 microsiemens / centimeter |
397055 | 110 | umho/cm | NaN | 110.0 microsiemens / centimeter |
397058 | 390 | umho/cm | NaN | 390.0 microsiemens / centimeter |
397061 | 65 | umho/cm | NaN | 65.0 microsiemens / centimeter |
1826 rows × 4 columns
Chlorophyll a
[55]:
# Chlorophyll a
df = harmonize.harmonize(df, 'Chlorophyll a', report=True)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'None' UNDEFINED UNIT for Chlorophyll
warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'ug/cm2' UNDEFINED UNIT for Chlorophyll
warn("WARNING: " + problem)
-Usable results-
count 9251.000000
mean 1.171505
std 1.200192
min -0.840000
25% 0.008395
50% 0.970000
75% 1.850000
max 9.990000
dtype: float64
Unusable results: 574
Usable results with inferred units: 0
Results outside threshold (0.0 to 8.372654750450925): 8
[56]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Chlorophyll']
df.loc[df['CharacteristicName']=='Chlorophyll a', cols]
[56]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Chlorophyll | |
---|---|---|---|---|
264 | NaN | NaN | ResultMeasureValue: missing (NaN) result; Resu... | NaN |
551 | NaN | NaN | ResultMeasureValue: missing (NaN) result; Resu... | NaN |
670 | 2.3 | mg/m3 | NaN | 0.0023000000000000004 milligram / liter |
1230 | NaN | NaN | ResultMeasureValue: missing (NaN) result; Resu... | NaN |
1370 | 2.9 | mg/m3 | NaN | 0.0029000000000000007 milligram / liter |
... | ... | ... | ... | ... |
397850 | 6.3 | mg/m3 | NaN | 0.006300000000000001 milligram / liter |
397859 | NaN | NaN | ResultMeasureValue: missing (NaN) result; Resu... | NaN |
397875 | 1.1 | mg/m3 | NaN | 0.0011000000000000003 milligram / liter |
397976 | 4.4 | mg/m3 | NaN | 0.004400000000000001 milligram / liter |
397978 | NaN | NaN | ResultMeasureValue: missing (NaN) result; Resu... | NaN |
9825 rows × 4 columns
Organic Carbon
[57]:
# Organic carbon (%)
df = harmonize.harmonize(df, 'Organic carbon', report=True)
-Usable results-
count 4639.000000
mean 1184.028265
std 11819.016505
min 0.000000
25% 2.700000
50% 4.300000
75% 8.400000
max 410000.000000
dtype: float64
Unusable results: 163
Usable results with inferred units: 0
Results outside threshold (0.0 to 72098.12729220463): 22
[58]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Carbon']
df.loc[df['CharacteristicName']=='Organic carbon', cols]
[58]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Carbon | |
---|---|---|---|---|
127 | 2.6 | mg/L | NaN | 2.6 milligram / liter |
178 | 5.2 | mg/L | NaN | 5.2 milligram / liter |
218 | 3.9 | mg/L | NaN | 3.9 milligram / liter |
296 | 1.0 | mg/L | NaN | 1.0 milligram / liter |
315 | 2.6 | mg/L | NaN | 2.6 milligram / liter |
... | ... | ... | ... | ... |
397818 | 5.439 | mg/L | NaN | 5.439 milligram / liter |
397926 | 5.347 | mg/L | NaN | 5.347 milligram / liter |
397946 | 3.424 | mg/L | NaN | 3.424 milligram / liter |
398001 | 6.301 | mg/L | NaN | 6.301 milligram / liter |
398024 | 4.297 | mg/L | NaN | 4.297 milligram / liter |
4802 rows × 4 columns
Turbidity
[59]:
# Turbidity (NTU)
df = harmonize.harmonize(df, 'Turbidity', report=True)
-Usable results-
count 38014.000000
mean 9.049020
std 210.608888
min -0.840000
25% 1.430000
50% 2.510000
75% 4.810000
max 32342.452300
dtype: float64
Unusable results: 167
Usable results with inferred units: 10
Results outside threshold (0.0 to 1272.7023456013437): 45
[60]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Turbidity']
df.loc[df['CharacteristicName']=='Turbidity', cols]
[60]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Turbidity | |
---|---|---|---|---|
14 | 0 | NTU | NaN | 0.0 Nephelometric_Turbidity_Units |
54 | 5.6 | NTU | NaN | 5.6 Nephelometric_Turbidity_Units |
56 | 28 | NTU | NaN | 28.0 Nephelometric_Turbidity_Units |
90 | 1.4 | NTU | NaN | 1.4 Nephelometric_Turbidity_Units |
92 | 4.7 | NTU | NaN | 4.7 Nephelometric_Turbidity_Units |
... | ... | ... | ... | ... |
397925 | 2.6 | NTU | NaN | 2.6 Nephelometric_Turbidity_Units |
397943 | 1.7 | NTU | NaN | 1.7 Nephelometric_Turbidity_Units |
397970 | 3.9 | NTU | NaN | 3.9 Nephelometric_Turbidity_Units |
397989 | 2.1 | NTU | NaN | 2.1 Nephelometric_Turbidity_Units |
397990 | 7.9 | NTU | NaN | 7.9 Nephelometric_Turbidity_Units |
38181 rows × 4 columns
Sediment
[61]:
# Sediment
df = harmonize.harmonize(df, 'Sediment', report=False)
[62]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Sediment']
df.loc[df['CharacteristicName']=='Sediment', cols]
[62]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Sediment |
---|
Phosphorus
Note: must be merged w/ activities (package runs query by site if not already merged)
[63]:
# Phosphorus
df = harmonize.harmonize(df, 'Phosphorus')
2 Phosphorus sample fractions not in frac_dict
2 Phosphorus sample fractions not in frac_dict found in expected domains, mapped to "Other_Phosphorus"
Note: warnings for unexpected characteristic fractions. Fractions are each seperated out into their own result column.
[64]:
# All Phosphorus
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'TDP_Phosphorus']
df.loc[df['Phosphorus'].notna(), cols]
[64]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | TDP_Phosphorus | |
---|---|---|---|---|
20 | .061 | mg/L | NaN | NaN |
96 | 0.03 | mg/L | NaN | NaN |
147 | .13 | mg/L | NaN | NaN |
161 | 0.003 | mg/L | NaN | NaN |
360 | 0.002 | mg/L | NaN | NaN |
... | ... | ... | ... | ... |
397821 | .487 | mg/L | NaN | NaN |
397825 | .04 | mg/L | NaN | NaN |
397835 | .091 | mg/L | NaN | NaN |
397977 | .151 | mg/L | NaN | NaN |
398007 | .107 | mg/L | NaN | NaN |
5730 rows × 4 columns
[65]:
# Total phosphorus
df.loc[df['TP_Phosphorus'].notna(), cols]
[65]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | TDP_Phosphorus | |
---|---|---|---|---|
20 | .061 | mg/L | NaN | NaN |
96 | 0.03 | mg/L | NaN | NaN |
147 | .13 | mg/L | NaN | NaN |
161 | 0.003 | mg/L | NaN | NaN |
360 | 0.002 | mg/L | NaN | NaN |
... | ... | ... | ... | ... |
397821 | .487 | mg/L | NaN | NaN |
397825 | .04 | mg/L | NaN | NaN |
397835 | .091 | mg/L | NaN | NaN |
397977 | .151 | mg/L | NaN | NaN |
398007 | .107 | mg/L | NaN | NaN |
5000 rows × 4 columns
[66]:
# Total dissolved phosphorus
df.loc[df['TDP_Phosphorus'].notna(), cols]
[66]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | TDP_Phosphorus | |
---|---|---|---|---|
3657 | 0.019 | mg/L | NaN | 0.019 milligram / liter |
8220 | 0.002 | mg/L | NaN | 0.002 milligram / liter |
14068 | 0.003 | mg/L | NaN | 0.003 milligram / liter |
17235 | 0.019 | mg/L | NaN | 0.019 milligram / liter |
56247 | 0.002 | mg/L | NaN | 0.002 milligram / liter |
58089 | 0.017 | mg/L | NaN | 0.017 milligram / liter |
73461 | 0.021 | mg/L | NaN | 0.021 milligram / liter |
76250 | 0.003 | mg/L | NaN | 0.003 milligram / liter |
101142 | 0.020 | mg/L | NaN | 0.02 milligram / liter |
106490 | 0.002 | mg/L | NaN | 0.002 milligram / liter |
155964 | 0.00806 | mg/L | NaN | 0.00806 milligram / liter |
159273 | 0.000031 | mg/L | NaN | 3.1e-05 milligram / liter |
160616 | 0.002542 | mg/L | NaN | 0.002542 milligram / liter |
161132 | 0.00341 | mg/L | NaN | 0.00341 milligram / liter |
200480 | 0.00372 | mg/L | NaN | 0.00372 milligram / liter |
202494 | 0.00961 | mg/L | NaN | 0.00961 milligram / liter |
203463 | 0.00124 | mg/L | NaN | 0.00124 milligram / liter |
204221 | 0.01271 | mg/L | NaN | 0.01271 milligram / liter |
395702 | 0.030 | mg/l as P | NaN | 0.03 milligram / liter |
395711 | 0.033 | mg/l as P | NaN | 0.033 milligram / liter |
395714 | 0.024 | mg/l as P | NaN | 0.024 milligram / liter |
395720 | 0.028 | mg/l as P | NaN | 0.028 milligram / liter |
395729 | 0.021 | mg/l as P | NaN | 0.021 milligram / liter |
395734 | 0.023 | mg/l as P | NaN | 0.023 milligram / liter |
395746 | 0.037 | mg/l as P | NaN | 0.037 milligram / liter |
395811 | 0.023 | mg/l as P | NaN | 0.023 milligram / liter |
395819 | 0.02 | mg/l as P | NaN | 0.02 milligram / liter |
395835 | 0.04 | mg/l as P | NaN | 0.04 milligram / liter |
395850 | 0.03 | mg/l as P | NaN | 0.03 milligram / liter |
395859 | 0.025 | mg/l as P | NaN | 0.025 milligram / liter |
395886 | 0.05 | mg/l as P | NaN | 0.05 milligram / liter |
395895 | 0.15 | mg/l as P | NaN | 0.15 milligram / liter |
395915 | 0.03 | mg/l as P | NaN | 0.03 milligram / liter |
396032 | 0.02 | mg/l as P | NaN | 0.02 milligram / liter |
396054 | 0.07 | mg/l as P | NaN | 0.07 milligram / liter |
396062 | 0.08 | mg/l as P | NaN | 0.08 milligram / liter |
396077 | 0.02 | mg/l as P | NaN | 0.02 milligram / liter |
396097 | 0.02 | mg/l as P | NaN | 0.02 milligram / liter |
396109 | 0.04 | mg/l as P | NaN | 0.04 milligram / liter |
396129 | 0.02 | mg/l as P | NaN | 0.02 milligram / liter |
396142 | 0.05 | mg/l as P | NaN | 0.05 milligram / liter |
396416 | 0.03 | mg/l as P | NaN | 0.03 milligram / liter |
396423 | 0.05 | mg/l as P | NaN | 0.05 milligram / liter |
[67]:
# All other phosphorus sample fractions
df.loc[df['Other_Phosphorus'].notna(), cols]
[67]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | TDP_Phosphorus | |
---|---|---|---|---|
25162 | .5 | mg/L | NaN | NaN |
25272 | .036 | mg/L | NaN | NaN |
26600 | .089 | mg/L | NaN | NaN |
27626 | .017 | mg/L | NaN | NaN |
28729 | .067 | mg/L | NaN | NaN |
... | ... | ... | ... | ... |
397017 | .18 | mg/L | NaN | NaN |
397028 | .25 | mg/L | NaN | NaN |
397040 | .16 | mg/L | NaN | NaN |
397043 | .18 | mg/L | NaN | NaN |
397052 | .31 | mg/L | NaN | NaN |
687 rows × 4 columns
Bacteria
Some equivalence assumptions are built-in where bacteria counts that are not equivalent are treated as such because there is no standard way to convert from one to another.
Fecal Coliform
[68]:
# Known unit with bad dimensionality ('Colony_Forming_Units * milliliter')
df = harmonize.harmonize(df, 'Fecal Coliform', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
-Usable results-
count 10035.000000
mean 45.537618
std 448.839329
min 0.000000
25% 4.000000
50% 8.000000
75% 33.000000
max 33000.000000
dtype: float64
Unusable results: 40551
Usable results with inferred units: 0
Results outside threshold (0.0 to 2738.5735941387825): 6
[69]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Fecal_Coliform']
df.loc[df['CharacteristicName']=='Fecal Coliform', cols]
[69]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Fecal_Coliform | |
---|---|---|---|---|
1 | 80 | cfu/100mL | NaN | NaN |
3 | 2 | MPN/100mL | NaN | NaN |
5 | *Non-detect | NaN | ResultMeasureValue: "*Non-detect" result canno... | NaN |
9 | *Non-detect | NaN | ResultMeasureValue: "*Non-detect" result canno... | NaN |
28 | 40 | cfu/100mL | NaN | NaN |
... | ... | ... | ... | ... |
396739 | 145 | cfu/100mL | NaN | NaN |
396765 | 317 | cfu/100mL | NaN | NaN |
396787 | 60 | cfu/100mL | NaN | NaN |
396912 | 600 | cfu/100mL | NaN | NaN |
396914 | 245 | cfu/100mL | NaN | NaN |
50586 rows × 4 columns
Escherichia coli
[70]:
# Known unit with bad dimensionality ('Colony_Forming_Units * milliliter')
df = harmonize.harmonize(df, 'Escherichia coli', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
-Usable results-
count 22.000000
mean 501.863636
std 610.053260
min 4.000000
25% 9.500000
50% 77.500000
75% 1000.000000
max 1700.000000
dtype: float64
Unusable results: 7974
Usable results with inferred units: 0
Results outside threshold (0.0 to 4162.183198738116): 0
[71]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'E_coli']
df.loc[df['CharacteristicName']=='Escherichia coli', cols]
[71]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | E_coli | |
---|---|---|---|---|
11 | 0 | cfu/100mL | NaN | NaN |
37 | 1000 | cfu/100mL | NaN | NaN |
45 | 0 | cfu/100mL | NaN | NaN |
99 | 33.3333333333333 | cfu/100mL | NaN | NaN |
136 | 0 | cfu/100mL | NaN | NaN |
... | ... | ... | ... | ... |
397811 | 28 | MPN/100mL | NaN | NaN |
397826 | 390 | MPN/100mL | NaN | NaN |
397838 | 150 | MPN/100mL | NaN | NaN |
397898 | 150 | MPN/100mL | NaN | NaN |
397907 | 53 | MPN/100mL | NaN | NaN |
7996 rows × 4 columns
Combining Salinity and Conductivity
Convert module has various functions to convert from one unit or characteristic to another. Some of these are used within a single characteristic during harmonization (e.g. DO saturation to concentration) while others are intended to model one characteristic as an indicator of another (e.g. estimate salinity from conductivity).
Note: this should only be done after both characteristic fields have been harmonized. Results before and after should be inspected, thresholds for outliers applied, and consider adding a QA_flag for modeled data.
Explore Salinity results:
[72]:
from harmonize_wq import convert
[73]:
# Salinity summary statistics
lst = [x.magnitude for x in list(df['Salinity'].dropna())]
q_sum = sum(lst)
print('Range: {} to {}'.format(min(lst), max(lst)))
print('Results: {} \nMean: {} PSU'.format(len(lst), q_sum/len(lst)))
Range: 0.0 to 37782.0
Results: 68190
Mean: 16.236780567532055 PSU
[74]:
# Identify extreme outliers
[x for x in lst if x >3200]
[74]:
[15030.0, 37782.0]
Other fields like units and QA_flag may help understand what caused high values and what results might need to be dropped from consideration
[75]:
# Columns to focus on
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity']
[76]:
# Look at important fields for max 5 values
salinity_series = df['Salinity'][df['Salinity'].notna()]
salinity_series.sort_values(ascending=False, inplace=True)
df[cols][df['Salinity'].isin(salinity_series[0:5])]
[76]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Salinity | |
---|---|---|---|---|
12623 | 15030 | ppt | NaN | 15030.0 Practical_Salinity_Units |
22079 | 322 | ppth | NaN | 322.0 Practical_Salinity_Units |
42633 | 2150 | ppth | NaN | 2150.0 Practical_Salinity_Units |
69507 | 37782 | ppth | NaN | 37782.0 Practical_Salinity_Units |
142665 | 2190 | ppt | NaN | 2190.0 Practical_Salinity_Units |
Detection limits may help understand what caused low values and what results might need to be dropped or updated
[77]:
from harmonize_wq import wrangle
[78]:
df = wrangle.add_detection(df, 'Salinity')
cols+=['ResultDetectionConditionText',
'DetectionQuantitationLimitTypeName',
'DetectionQuantitationLimitMeasure/MeasureValue',
'DetectionQuantitationLimitMeasure/MeasureUnitCode']
[79]:
# Look at important fields for min 5 values (often multiple 0.0)
df[cols][df['Salinity'].isin(salinity_series[-5:])]
[79]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Salinity | ResultDetectionConditionText | DetectionQuantitationLimitTypeName | DetectionQuantitationLimitMeasure/MeasureValue | DetectionQuantitationLimitMeasure/MeasureUnitCode | |
---|---|---|---|---|---|---|---|---|
2209 | 0.00 | ppth | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
3786 | 0.00 | ppth | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
9784 | 0.0 | ppth | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
18078 | 0.0 | ppth | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
32610 | 0.0 | ppth | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... |
307362 | 0 | ppth | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
307368 | 0 | ppth | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
345344 | 0.0 | ppth | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
345565 | 0.0 | ppth | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
360962 | 0 | ppth | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
2324 rows × 8 columns
Explore Conductivity results:
[80]:
# Create series and inspect Conductivity values
cond_series = df['Conductivity'].dropna()
cond_series
[80]:
8 19204.2 microsiemens / centimeter
50 222.3 microsiemens / centimeter
232 102.8 microsiemens / centimeter
394 11017.5 microsiemens / centimeter
746 32.0 microsiemens / centimeter
...
397047 110.0 microsiemens / centimeter
397054 65.0 microsiemens / centimeter
397055 110.0 microsiemens / centimeter
397058 390.0 microsiemens / centimeter
397061 65.0 microsiemens / centimeter
Name: Conductivity, Length: 1818, dtype: object
Conductivity thresholds from Freshwater Explorer: 10 > x < 5000 us/cm, use a higher threshold for coastal waters
[81]:
# Sort and check other relevant columns before converting (e.g. Salinity)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity', 'Conductivity']
df.sort_values(by=['Conductivity'], ascending=False, inplace=True)
df.loc[df['Conductivity'].notna(), cols]
[81]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Salinity | Conductivity | |
---|---|---|---|---|---|
129171 | 54886.2 | umho/cm | NaN | NaN | 54886.2 microsiemens / centimeter |
132518 | 54871.3 | umho/cm | NaN | NaN | 54871.3 microsiemens / centimeter |
125816 | 54860.6 | umho/cm | NaN | NaN | 54860.6 microsiemens / centimeter |
131249 | 54859.3 | umho/cm | NaN | NaN | 54859.3 microsiemens / centimeter |
125215 | 54850.8 | umho/cm | NaN | NaN | 54850.8 microsiemens / centimeter |
... | ... | ... | ... | ... | ... |
86435 | 6.8 | umho/cm | NaN | NaN | 6.8 microsiemens / centimeter |
51516 | 2 | umho/cm | NaN | NaN | 2.0 microsiemens / centimeter |
142803 | 2 | umho/cm | NaN | NaN | 2.0 microsiemens / centimeter |
36855 | 1 | umho/cm | NaN | NaN | 1.0 microsiemens / centimeter |
143983 | .04 | umho/cm | NaN | NaN | 0.04 microsiemens / centimeter |
1818 rows × 5 columns
[82]:
# Check other relevant columns before converting (e.g. Salinity)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity', 'Conductivity']
df.loc[df['Conductivity'].notna(), cols]
[82]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Salinity | Conductivity | |
---|---|---|---|---|---|
129171 | 54886.2 | umho/cm | NaN | NaN | 54886.2 microsiemens / centimeter |
132518 | 54871.3 | umho/cm | NaN | NaN | 54871.3 microsiemens / centimeter |
125816 | 54860.6 | umho/cm | NaN | NaN | 54860.6 microsiemens / centimeter |
131249 | 54859.3 | umho/cm | NaN | NaN | 54859.3 microsiemens / centimeter |
125215 | 54850.8 | umho/cm | NaN | NaN | 54850.8 microsiemens / centimeter |
... | ... | ... | ... | ... | ... |
86435 | 6.8 | umho/cm | NaN | NaN | 6.8 microsiemens / centimeter |
51516 | 2 | umho/cm | NaN | NaN | 2.0 microsiemens / centimeter |
142803 | 2 | umho/cm | NaN | NaN | 2.0 microsiemens / centimeter |
36855 | 1 | umho/cm | NaN | NaN | 1.0 microsiemens / centimeter |
143983 | .04 | umho/cm | NaN | NaN | 0.04 microsiemens / centimeter |
1818 rows × 5 columns
[83]:
# Convert values to PSU and write to Salinity
cond_series = cond_series.apply(str) # Convert to string to convert to dimensionless (PSU)
df.loc[df['Conductivity'].notna(), 'Salinity'] = cond_series.apply(convert.conductivity_to_PSU)
df.loc[df['Conductivity'].notna(), 'Salinity']
[83]:
129171 36.356 dimensionless
132518 36.345 dimensionless
125816 36.338 dimensionless
131249 36.336 dimensionless
125215 36.33 dimensionless
...
86435 0.013 dimensionless
51516 0.012 dimensionless
142803 0.012 dimensionless
36855 0.012 dimensionless
143983 0.012 dimensionless
Name: Salinity, Length: 1818, dtype: object
Datetime
datetime() formats time using dataretrieval and ActivityStart
[84]:
# First inspect the existing unformated fields
cols = ['ActivityStartDate', 'ActivityStartTime/Time', 'ActivityStartTime/TimeZoneCode']
df[cols]
[84]:
ActivityStartDate | ActivityStartTime/Time | ActivityStartTime/TimeZoneCode | |
---|---|---|---|
129171 | 2007-08-09 | 12:15:00 | CST |
132518 | 2007-08-09 | 12:15:00 | CST |
125816 | 2007-08-09 | 12:15:00 | CST |
131249 | 2007-08-09 | 12:15:00 | CST |
125215 | 2007-08-09 | 12:15:00 | CST |
... | ... | ... | ... |
398031 | 2024-03-13 | 08:20:00 | CDT |
398032 | 2024-03-06 | 07:45:00 | CDT |
398033 | 2024-03-06 | 07:45:00 | CDT |
398034 | 2024-03-13 | 08:20:00 | CDT |
398035 | 2024-03-06 | 11:30:00 | CDT |
398036 rows × 3 columns
[85]:
# 'ActivityStartDate' presserves date where 'Activity_datetime' is NAT due to no time zone
df = clean.datetime(df)
df[['ActivityStartDate', 'Activity_datetime']]
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/dataretrieval/utils.py:87: UserWarning: Warning: 35999 incomplete dates found, consider setting datetime_index to False.
warnings.warn(
[85]:
ActivityStartDate | Activity_datetime | |
---|---|---|
129171 | 2007-08-09 | 2007-08-09 18:15:00+00:00 |
132518 | 2007-08-09 | 2007-08-09 18:15:00+00:00 |
125816 | 2007-08-09 | 2007-08-09 18:15:00+00:00 |
131249 | 2007-08-09 | 2007-08-09 18:15:00+00:00 |
125215 | 2007-08-09 | 2007-08-09 18:15:00+00:00 |
... | ... | ... |
398031 | 2024-03-13 | 2024-03-13 13:20:00+00:00 |
398032 | 2024-03-06 | 2024-03-06 12:45:00+00:00 |
398033 | 2024-03-06 | 2024-03-06 12:45:00+00:00 |
398034 | 2024-03-13 | 2024-03-13 13:20:00+00:00 |
398035 | 2024-03-06 | 2024-03-06 16:30:00+00:00 |
398036 rows × 2 columns
Activity_datetime combines all three time component columns into UTC. If time is missing this is NaT so a ActivityStartDate column is used to preserve date only.
Depth
Note: Data are often lacking sample depth metadata
[86]:
# Depth of sample (default units='meter')
df = clean.harmonize_depth(df)
#df.loc[df['ResultDepthHeightMeasure/MeasureValue'].dropna(), "Depth"]
df['ResultDepthHeightMeasure/MeasureValue'].dropna()
[86]:
1618 7.0
3651 7.0
73415 0.1
73452 2.2
73526 2.0
...
82807 2.2
87971 1.0
88441 16.0
88845 16.0
142158 35.0
Name: ResultDepthHeightMeasure/MeasureValue, Length: 179, dtype: float64
Characteristic to Column (long to wide format)
[87]:
# Split single QA column into multiple by characteristic (rename the result to preserve these QA_flags)
df2 = wrangle.split_col(df)
df2
[87]:
OrganizationIdentifier | OrganizationFormalName | ActivityIdentifier | ActivityStartDate | ActivityStartTime/Time | ActivityStartTime/TimeZoneCode | MonitoringLocationIdentifier | ResultIdentifier | DataLoggerLine | ResultDetectionConditionText | ... | QA_pH | QA_Conductivity | QA_E_coli | QA_DO | QA_Turbidity | QA_Secchi | QA_Chlorophyll | QA_Nitrogen | QA_Temperature | QA_Carbon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
129171 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-51908_230231_173 | 2007-08-09 | 12:15:00 | -0600 | 21AWIC-1122 | STORET-170383613 | 230231.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
132518 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-51908_230230_173 | 2007-08-09 | 12:15:00 | -0600 | 21AWIC-1122 | STORET-170383607 | 230230.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
125816 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-51908_230228_173 | 2007-08-09 | 12:15:00 | -0600 | 21AWIC-1122 | STORET-170383595 | 230228.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
131249 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-51908_230229_173 | 2007-08-09 | 12:15:00 | -0600 | 21AWIC-1122 | STORET-170383601 | 230229.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
125215 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-51908_230227_173 | 2007-08-09 | 12:15:00 | -0600 | 21AWIC-1122 | STORET-170383589 | 230227.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
398031 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-335811_1873104_173 | 2024-03-13 | 08:20:00 | -0500 | 21AWIC-9768 | STORET-1039487062 | 1873104.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
398032 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-335642_1872423_173 | 2024-03-06 | 07:45:00 | -0500 | 21AWIC-1208 | STORET-1039486410 | 1872423.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
398033 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-335642_1872426_173 | 2024-03-06 | 07:45:00 | -0500 | 21AWIC-1208 | STORET-1039486431 | 1872426.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
398034 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-335811_1873106_173 | 2024-03-13 | 08:20:00 | -0500 | 21AWIC-9768 | STORET-1039487073 | 1873106.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
398035 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-335728_1874876_173 | 2024-03-06 | 11:30:00 | -0500 | 21AWIC-1152 | STORET-1039490686 | 1874876.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
347782 rows × 117 columns
[88]:
# This expands the single col (QA_flag) out to a number of new columns based on the unique characteristicNames and speciation
print('{} new columns'.format(len(df2.columns) - len(df.columns)))
14 new columns
[89]:
# Note: there are fewer rows because NAN results are also dropped in this step
print('{} fewer rows'.format(len(df)-len(df2)))
50254 fewer rows
[90]:
#Examine Carbon flags from earlier in notebook (note these are empty now because NAN is dropped)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'Carbon', 'QA_Carbon']
df2.loc[df2['QA_Carbon'].notna(), cols]
[90]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | Carbon | QA_Carbon |
---|
Next the table is divided into the columns of interest (main_df) and characteristic specific metadata (chars_df)
[91]:
# split table into main and characteristics tables
main_df, chars_df = wrangle.split_table(df2)
[92]:
# Columns still in main table
main_df.columns
[92]:
Index(['OrganizationIdentifier', 'OrganizationFormalName',
'ActivityIdentifier', 'MonitoringLocationIdentifier', 'ProviderName',
'Secchi', 'Temperature', 'DO', 'pH', 'Salinity', 'Nitrogen',
'Speciation', 'TOTAL NITROGEN_ MIXED FORMS', 'Conductivity',
'Chlorophyll', 'Carbon', 'Turbidity', 'Sediment', 'Phosphorus',
'TP_Phosphorus', 'TDP_Phosphorus', 'Other_Phosphorus', 'Fecal_Coliform',
'E_coli', 'DetectionQuantitationLimitTypeName',
'DetectionQuantitationLimitMeasure/MeasureValue',
'DetectionQuantitationLimitMeasure/MeasureUnitCode',
'Activity_datetime', 'Depth', 'QA_TP_Phosphorus', 'QA_TDP_Phosphorus',
'QA_Other_Phosphorus', 'QA_Salinity', 'QA_Fecal_Coliform', 'QA_pH',
'QA_Conductivity', 'QA_E_coli', 'QA_DO', 'QA_Turbidity', 'QA_Secchi',
'QA_Chlorophyll', 'QA_Nitrogen', 'QA_Temperature', 'QA_Carbon'],
dtype='object')
[93]:
# look at main table results (first 5)
main_df.head()
[93]:
OrganizationIdentifier | OrganizationFormalName | ActivityIdentifier | MonitoringLocationIdentifier | ProviderName | Secchi | Temperature | DO | pH | Salinity | ... | QA_pH | QA_Conductivity | QA_E_coli | QA_DO | QA_Turbidity | QA_Secchi | QA_Chlorophyll | QA_Nitrogen | QA_Temperature | QA_Carbon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
129171 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-51908_230231_173 | 21AWIC-1122 | STORET | NaN | NaN | NaN | NaN | 36.356 dimensionless | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
132518 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-51908_230230_173 | 21AWIC-1122 | STORET | NaN | NaN | NaN | NaN | 36.345 dimensionless | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
125816 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-51908_230228_173 | 21AWIC-1122 | STORET | NaN | NaN | NaN | NaN | 36.338 dimensionless | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
131249 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-51908_230229_173 | 21AWIC-1122 | STORET | NaN | NaN | NaN | NaN | 36.336 dimensionless | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
125215 | 21AWIC | ALABAMA DEPT. OF ENVIRONMENTAL MANAGEMENT - WA... | 21AWIC-51908_230227_173 | 21AWIC-1122 | STORET | NaN | NaN | NaN | NaN | 36.33 dimensionless | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 44 columns
[94]:
# Empty columns that could be dropped (Mostly QA columns)
cols = list(main_df.columns)
x = main_df.dropna(axis=1, how='all')
[col for col in cols if col not in x.columns]
[94]:
['Sediment',
'QA_TP_Phosphorus',
'QA_TDP_Phosphorus',
'QA_Other_Phosphorus',
'QA_Fecal_Coliform',
'QA_Conductivity',
'QA_E_coli',
'QA_Secchi',
'QA_Carbon']
[95]:
# Map average results at each station
gdf_avg = visualize.map_measure(main_df, stations_clipped, 'Temperature')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[95]:
<Axes: >