Cape Cod - Detailed step-by-step
Standardize, clean and wrangle Water Quality Portal data in Cape Cod into more analytic-ready formats using the harmonize_wq package
US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonixe_wq package is intended to be a flexible water quality specific framework to help: - Identify differences in data units (including speciation and basis) - Identify differences in sampling or analytic methods - Resolve data errors using transparent assumptions - Reduce data to the columns that are most commonly needed - Transform data from long to wide format
Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.
Detailed step-by-step workflow
This example workflow takes a deeper dive into some of the expanded functionality to examine results for different water quality parameters in Cape Cod
Install and import the required libraries
[1]:
import sys
#! python -m pip uninstall harmonize-wq --yes
# Use pip to install the package from pypi or the latest from github
#!{sys.executable} -m pip install harmonize-wq
# For latest dev version
#!{sys.executable} -m pip install git+https://github.com/USEPA/harmonize-wq.git
[2]:
import dataretrieval.wqp as wqp
from harmonize_wq import wrangle
from harmonize_wq import location
from harmonize_wq import harmonize
from harmonize_wq import visualize
from harmonize_wq import clean
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/dataretrieval/nadp.py:44: UserWarning: GDAL not installed. Some functions will not work.
warnings.warn('GDAL not installed. Some functions will not work.')
Download location data using dataretrieval
[3]:
# Read geometry for Area of Interest from geojson file url and plot
aoi_url = 'https://github.com/jbousquin/test_notebook/raw/main/temperature_data/NewEngland.geojson'
aoi_gdf = wrangle.as_gdf(aoi_url) # Already 4326 standard
aoi_gdf.plot()
[3]:
<Axes: >
[4]:
# Build query with characteristicNames and the AOI extent
query = {'characteristicName': ['Phosphorus',
'Temperature, water',
'Depth, Secchi disk depth',
'Dissolved oxygen (DO)',
'Salinity',
'pH',
'Nitrogen',
'Conductivity',
'Organic carbon',
'Chlorophyll a',
'Turbidity',
'Sediment',
'Fecal Coliform',
'Escherichia coli']}
query['bBox'] = wrangle.get_bounding_box(aoi_gdf)
[5]:
# Query stations (can be slow)
stations, site_md = wqp.what_sites(**query)
[6]:
# Rows and columns for results
stations.shape
[6]:
(10580, 37)
[7]:
# First 5 rows
stations.head()
[7]:
OrganizationIdentifier | OrganizationFormalName | MonitoringLocationIdentifier | MonitoringLocationName | MonitoringLocationTypeName | MonitoringLocationDescriptionText | HUCEightDigitCode | DrainageAreaMeasure/MeasureValue | DrainageAreaMeasure/MeasureUnitCode | ContributingDrainageAreaMeasure/MeasureValue | ... | AquiferName | LocalAqfrName | FormationTypeText | AquiferTypeName | ConstructionDateText | WellDepthMeasure/MeasureValue | WellDepthMeasure/MeasureUnitCode | WellHoleDepthMeasure/MeasureValue | WellHoleDepthMeasure/MeasureUnitCode | ProviderName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | USGS-MA | USGS Massachusetts Water Science Center | USGS-010965305 | MERRIMACK R NR TYNGSBOROUGH BRIDGE TYNGSBOROUG... | Stream | NaN | 1070006.0 | 4070.00 | sq mi | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
1 | USGS-MA | USGS Massachusetts Water Science Center | USGS-01096544 | STONY BROOK AT SCHOOL STREET AT CHELMSFORD, MA | Stream | NaN | 1070006.0 | 41.57 | sq mi | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
2 | USGS-MA | USGS Massachusetts Water Science Center | USGS-01096546 | STONY BROOK AT CHELMSFORD, MA | Stream | NaN | 1070006.0 | 43.60 | sq mi | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
3 | USGS-MA | USGS Massachusetts Water Science Center | USGS-01096548 | STONY BROOK AT N CHELMSFORD, MA | Stream | NaN | 1070006.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
4 | USGS-MA | USGS Massachusetts Water Science Center | USGS-01096550 | MERRIMACK RIVER ABOVE LOWELL, MA | Stream | NaN | 1070006.0 | 3900.00 | sq mi | 3900.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
5 rows × 37 columns
[8]:
# Columns used for an example row
stations.iloc[0][['HorizontalCoordinateReferenceSystemDatumName', 'LatitudeMeasure', 'LongitudeMeasure']]
[8]:
HorizontalCoordinateReferenceSystemDatumName NAD83
LatitudeMeasure 42.677389
LongitudeMeasure -71.421056
Name: 0, dtype: object
[9]:
# Harmonize location datums to 4326 (Note we keep intermediate columns using intermediate_columns=True)
stations_gdf = location.harmonize_locations(stations, outEPSG=4326, intermediate_columns=True)
[10]:
# Every function has a dostring to help understand input/output and what it does
location.harmonize_locations?
[11]:
# Rows and columns for results after running the function (5 new columns, only 2 new if intermediate_columns=False)
stations_gdf.shape
[11]:
(10580, 42)
[12]:
# Example results for the new columns
stations_gdf.iloc[0][['geom_orig', 'EPSG', 'QA_flag', 'geom', 'geometry']]
[12]:
geom_orig (-71.4210556, 42.67738889)
EPSG 4269.0
QA_flag NaN
geom POINT (-71.4210556 42.67738889)
geometry POINT (-71.4210556 42.67738889)
Name: 0, dtype: object
[13]:
# geom and geometry look the same but geometry is a special datatype
stations_gdf['geometry'].dtype
[13]:
<geopandas.array.GeometryDtype at 0x7f396a1c28d0>
[14]:
# Look at the different QA_flag flags that have been assigned,
# e.g., for bad datums or limited decimal precision
set(stations_gdf.loc[stations_gdf['QA_flag'].notna()]['QA_flag'])
[14]:
{'HorizontalCoordinateReferenceSystemDatumName: Bad datum OTHER, EPSG:4326 assumed',
'HorizontalCoordinateReferenceSystemDatumName: Bad datum UNKWN, EPSG:4326 assumed',
'LatitudeMeasure: Imprecise: lessthan3decimaldigits',
'LatitudeMeasure: Imprecise: lessthan3decimaldigits; LongitudeMeasure: Imprecise: lessthan3decimaldigits',
'LongitudeMeasure: Imprecise: lessthan3decimaldigits'}
[15]:
# Map it
stations_gdf.plot()
[15]:
<Axes: >
[16]:
# Clip to area of interest
stations_clipped = wrangle.clip_stations(stations_gdf, aoi_gdf)
[17]:
# Map it
stations_clipped.plot()
[17]:
<Axes: >
[18]:
# How many stations now?
len(stations_clipped)
[18]:
1519
[19]:
# To save the results to a shapefile
#import os
#path = '' #specify the path (folder/directory) to save it to
#stations_clipped.to_file(os.path.join(path,'CapeCod_stations.shp'))
Retrieve Characteristic Data
[20]:
# Now query for results
query['dataProfile'] = 'narrowResult'
res_narrow, md_narrow = wqp.get_results(**query)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/dataretrieval/wqp.py:83: DtypeWarning: Columns (8,10,13,15,17,19,20,21,22,23,28,31,32,33,34,36,38,60,63,64,65,66,67,68,69,70,71,72) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv(StringIO(response.text), delimiter=',')
[21]:
df = res_narrow
df
[21]:
OrganizationIdentifier | OrganizationFormalName | ActivityIdentifier | ActivityStartDate | ActivityStartTime/Time | ActivityStartTime/TimeZoneCode | MonitoringLocationIdentifier | ResultIdentifier | DataLoggerLine | ResultDetectionConditionText | ... | AnalysisEndTime/TimeZoneCode | ResultLaboratoryCommentCode | ResultLaboratoryCommentText | ResultDetectionQuantitationLimitUrl | LaboratoryAccreditationIndicator | LaboratoryAccreditationAuthorityName | TaxonomistAccreditationIndicator | TaxonomistAccreditationAuthorityName | LabSamplePreparationUrl | ProviderName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | CRWA | Charles River Watershed Association (Massachus... | CRWA-CYN20130809ROBTemp01 | 2013-08-09 | 11:14:33 | EST | CRWA-ROB | STORET-591631481 | 130809111433.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
1 | 11NPSWRD_WQX | National Park Service Water Resources Division | 11NPSWRD_WQX-CACO_HX30_21_7/22/2013_SFW_0.01 | 2013-07-22 | 11:01:00 | EDT | 11NPSWRD_WQX-CACO_HX30_21 | STORET-986369728 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
2 | MASSDEP | Massachusetts Department of Environmental Prot... | MASSDEP-LB-5142 | 2013-07-30 | 14:20:00 | EDT | MASSDEP-W2173 | STORET-762955917 | NaN | NaN | ... | NaN | NaN | NaN | https://www.waterqualitydata.us/data/providers... | NaN | NaN | NaN | NaN | NaN | STORET |
3 | IRWA | Ipswich River Watershed Association (Volunteer) | IRWA-HB:20130630120000:FM | 2013-06-30 | 12:00:00 | EDT | IRWA-HB | STORET-853064665 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
4 | CRWA | Charles River Watershed Association (Massachus... | CRWA-VMM20131217609SEC02 | 2013-12-17 | 07:40:00 | EST | CRWA-609S | STORET-872379847 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STORET |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
416607 | USGS-MA | USGS Massachusetts Water Science Center | nwisma.01.02400495 | 2024-01-08 | 12:00:00 | EST | USGS-01098530 | NWIS-126836828 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
416608 | USGS-MA | USGS Massachusetts Water Science Center | nwisma.01.02400639 | 2024-02-12 | 08:45:00 | EST | USGS-01097050 | NWIS-126989046 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
416609 | USGS-MA | USGS Massachusetts Water Science Center | nwisma.01.02400639 | 2024-02-12 | 08:45:00 | EST | USGS-01097050 | NWIS-126989056 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
416610 | USGS-MA | USGS Massachusetts Water Science Center | nwisma.01.02400639 | 2024-02-12 | 08:45:00 | EST | USGS-01097050 | NWIS-126989063 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
416611 | USGS-MA | USGS Massachusetts Water Science Center | nwisma.01.02400639 | 2024-02-12 | 08:45:00 | EST | USGS-01097050 | NWIS-126989064 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NWIS |
416612 rows × 78 columns
[22]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(df, stations_clipped)
legend_kwds = {"fmt": "{:.0f}", 'bbox_to_anchor':(1, 0.75)}
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
[22]:
<Axes: >
Harmonize Characteristic Results
Two options for functions to harmonize characteristics: harmonize_all() or harmonize_generic(). harmonize_all runs functions on all characteristics and lets you specify how to handle errors harmonize_generic runs functions only on the characteristic specified with char_val and lets you also choose output units, to keep intermediate columns and to do a quick report summarizing changes.
[23]:
# See Documentation
#harmonize_WQP.harmonize_all?
#harmonize_WQP.harmonize?
secchi disk depth
[24]:
# Each harmonize function has optional params, e.g., char_val is the characticName column value to use so we can send the entire df.
# Optional params: units='m', char_val='Depth, Secchi disk depth', out_col='Secchi', report=False)
# We start by demonstrating on secchi disk depth (units default to m, keep intermediate fields, see report)
df = harmonize.harmonize(df, "Depth, Secchi disk depth", errors="ignore", intermediate_columns=True, report=True)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'None' UNDEFINED UNIT for Secchi
warn("WARNING: " + problem)
-Usable results-
count 4859.000000
mean -12.392416
std 405.605559
min -9999.000000
25% 1.600000
50% 3.500000
75% 6.000000
max 17.000000
dtype: float64
Unusable results: 264
Usable results with inferred units: 0
Results outside threshold (0.0 to 2421.240935089463): 16
The threshold is based on standard deviations and is currently only used in the histogram.
[25]:
# Look at a table of just Secchi results and focus on subset of columns
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Units']
sechi_results = df.loc[df['CharacteristicName']=='Depth, Secchi disk depth', cols + ['Secchi']]
sechi_results
[25]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Units | Secchi | |
---|---|---|---|---|---|---|
17 | 11113300-GRTKINSD | 3.25 | m | NaN | m | 3.25 meter |
137 | 11113300-BEADERD | 4.25 | m | NaN | m | 4.25 meter |
311 | 11113300-ANGSDND | 3.5 | m | NaN | m | 3.5 meter |
577 | 11113300-GRTKINSD | 2.25 | m | NaN | m | 2.25 meter |
597 | 11113300-GRTKINSD | 4.75 | m | NaN | m | 4.75 meter |
... | ... | ... | ... | ... | ... | ... |
415922 | 11NPSWRD_WQX-CACO_DUCK_W | 16.5 | m | NaN | m | 16.5 meter |
415923 | 11NPSWRD_WQX-CACO_DUCK_W | 10.0 | m | NaN | m | 10.0 meter |
415924 | 11NPSWRD_WQX-CACO_DUCK_W | 17.0 | m | NaN | m | 17.0 meter |
415925 | 11NPSWRD_WQX-CACO_DUCK_W | 16.0 | m | NaN | m | 16.0 meter |
415926 | 11NPSWRD_WQX-CACO_GREAT_W | 7.8 | m | NaN | m | 7.8 meter |
5123 rows × 6 columns
[26]:
# Look at unusable(NAN) results
sechi_results.loc[df['Secchi'].isna()]
[26]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Units | Secchi | |
---|---|---|---|---|---|---|
18592 | NARS_WQX-NLA12_MA-102 | NaN | NaN | ResultMeasureValue: missing (NaN) result; Resu... | m | NaN |
29771 | EPA_OCMA_R1-SS-2 | dark | m | ResultMeasureValue: "dark" result cannot be used | m | NaN |
32659 | EPA_OCMA_R1-SS-3 | dark | m | ResultMeasureValue: "dark" result cannot be used | m | NaN |
34650 | EPA_OCMA_R1-R1-23 | dark | m | ResultMeasureValue: "dark" result cannot be used | m | NaN |
36903 | EPA_OCMA_R1-R1-25 | dark | m | ResultMeasureValue: "dark" result cannot be used | m | NaN |
... | ... | ... | ... | ... | ... | ... |
328138 | 11NPSWRD_WQX-CACO_GREAT_W | NaN | m | ResultMeasureValue: missing (NaN) result | m | NaN |
328218 | 11NPSWRD_WQX-CACO_GREAT_W | NaN | m | ResultMeasureValue: missing (NaN) result | m | NaN |
328574 | 11NPSWRD_WQX-CACO_GREAT_W | NaN | m | ResultMeasureValue: missing (NaN) result | m | NaN |
329325 | 11NPSWRD_WQX-CACO_GREAT_W | NaN | m | ResultMeasureValue: missing (NaN) result | m | NaN |
415913 | 11NPSWRD_WQX-CACO_SLOUGH | NaN | m | ResultMeasureValue: missing (NaN) result | m | NaN |
264 rows × 6 columns
[27]:
# look at the QA flag for first row from above
list(sechi_results.loc[df['Secchi'].isna()]['QA_flag'])[0]
[27]:
'ResultMeasureValue: missing (NaN) result; ResultMeasure/MeasureUnitCode: MISSING UNITS, m assumed'
[28]:
# All cases where there was a QA flag
sechi_results.loc[df['QA_flag'].notna()]
[28]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Units | Secchi | |
---|---|---|---|---|---|---|
18592 | NARS_WQX-NLA12_MA-102 | NaN | NaN | ResultMeasureValue: missing (NaN) result; Resu... | m | NaN |
29737 | NARS_WQX-NCCA10-1070 | -9 | None | ResultMeasure/MeasureUnitCode: 'None' UNDEFINE... | m | -9.0 meter |
29771 | EPA_OCMA_R1-SS-2 | dark | m | ResultMeasureValue: "dark" result cannot be used | m | NaN |
29918 | NARS_WQX-NCCA10-1029 | -9 | None | ResultMeasure/MeasureUnitCode: 'None' UNDEFINE... | m | -9.0 meter |
31135 | NARS_WQX-NCCA10-1001 | -9 | None | ResultMeasure/MeasureUnitCode: 'None' UNDEFINE... | m | -9.0 meter |
... | ... | ... | ... | ... | ... | ... |
328138 | 11NPSWRD_WQX-CACO_GREAT_W | NaN | m | ResultMeasureValue: missing (NaN) result | m | NaN |
328218 | 11NPSWRD_WQX-CACO_GREAT_W | NaN | m | ResultMeasureValue: missing (NaN) result | m | NaN |
328574 | 11NPSWRD_WQX-CACO_GREAT_W | NaN | m | ResultMeasureValue: missing (NaN) result | m | NaN |
329325 | 11NPSWRD_WQX-CACO_GREAT_W | NaN | m | ResultMeasureValue: missing (NaN) result | m | NaN |
415913 | 11NPSWRD_WQX-CACO_SLOUGH | NaN | m | ResultMeasureValue: missing (NaN) result | m | NaN |
272 rows × 6 columns
If both value and unit are missing nothing can be done, a unitless (NaN) value is assumed as to be in default units but a QA_flag is added
[29]:
# Aggregate secchi data by station
visualize.station_summary(sechi_results, 'Secchi')
[29]:
MonitoringLocationIdentifier | cnt | mean | |
---|---|---|---|
0 | 11113300-ANGSDND | 42 | 3.796369 |
1 | 11113300-BEADERD | 42 | 3.644048 |
2 | 11113300-CANWIND | 22 | 6.144000 |
3 | 11113300-CAPSALD | 24 | 2.456262 |
4 | 11113300-COBWINND | 21 | 3.768810 |
... | ... | ... | ... |
689 | WWMD_VA-SH1 | 2 | 1.500000 |
690 | WWMD_VA-SH2 | 2 | 1.800000 |
691 | WWMD_VA-SR6A | 1 | 0.500000 |
692 | WWMD_VA-WF2 | 1 | 0.600000 |
693 | WWMD_VA-WR5 | 1 | 2.000000 |
694 rows × 3 columns
[30]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(sechi_results, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/mapclassify/classifiers.py:1653: UserWarning: Not enough unique values in array to form 5 classes. Setting k to 4.
self.bins = quantile(y, k=k)
[30]:
<Axes: >
[31]:
# Map average results at each station
gdf_avg = visualize.map_measure(sechi_results, stations_clipped, 'Secchi')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[31]:
<Axes: >
Temperature
The default error=’raise’, makes it so that there is an error when there is a dimensionality error (i.e. when units can’t be converted). Here we would get the error: DimensionalityError: Cannot convert from ‘count’ (dimensionless) to ‘degree_Celsius’ ([temperature])
[32]:
#'Temperature, water'
# errors=‘ignore’, invalid dimension conversions will return the NaN.
df = harmonize.harmonize(df, 'Temperature, water', intermediate_columns=True, report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'count' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
-Usable results-
count 119742.000000
mean 16.041278
std 7.327751
min -6.000000
25% 11.000000
50% 16.730000
75% 21.683374
max 910.000000
dtype: float64
Unusable results: 213
Usable results with inferred units: 217
Results outside threshold (0.0 to 60.007781720935796): 116
[33]:
# Look at what was changed
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Temperature', 'Units']
temperature_results = df.loc[df['CharacteristicName']=='Temperature, water', cols]
temperature_results
[33]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Temperature | Units | |
---|---|---|---|---|---|---|
0 | CRWA-ROB | 23.01 | deg C | NaN | 23.01 degree_Celsius | degC |
8 | CRWA-635S | 5.5 | deg C | NaN | 5.5 degree_Celsius | degC |
15 | CRWA-4LONG | 28.2 | deg C | NaN | 28.2 degree_Celsius | degC |
22 | BRC-C-02-03-040 | 15 | deg C | NaN | 15.0 degree_Celsius | degC |
26 | 11113300-BCHHSPHAMLF | 15 | deg C | NaN | 15.0 degree_Celsius | degC |
... | ... | ... | ... | ... | ... | ... |
416594 | USGS-413831070304703 | 11.3 | deg C | NaN | 11.3 degree_Celsius | degC |
416596 | USGS-01100627 | 7.0 | deg C | NaN | 7.0 degree_Celsius | degC |
416600 | USGS-01096548 | 2.9 | deg C | NaN | 2.9 degree_Celsius | degC |
416604 | USGS-01098530 | 1.5 | deg C | NaN | 1.5 degree_Celsius | degC |
416608 | USGS-01097050 | 3.9 | deg C | NaN | 3.9 degree_Celsius | degC |
119955 rows × 6 columns
[34]:
# Examine deg F
temperature_results.loc[df['ResultMeasure/MeasureUnitCode'] == 'deg F']
[34]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Temperature | Units | |
---|---|---|---|---|---|---|
9748 | NALMS-F865245 | 81 | deg F | NaN | 27.222222222222285 degree_Celsius | degF |
22987 | 11113300-HOODERD | 59.2 | deg F | NaN | 15.111111111111143 degree_Celsius | degF |
23336 | 11113300-HOODERD | 59.1 | deg F | NaN | 15.0555555555556 degree_Celsius | degF |
23626 | 11113300-HOODERD | 59.9 | deg F | NaN | 15.500000000000057 degree_Celsius | degF |
24108 | 11113300-HOODERD | 56 | deg F | NaN | 13.333333333333371 degree_Celsius | degF |
... | ... | ... | ... | ... | ... | ... |
412393 | 11NPSWRD_WQX-SAIR_MEQ76_NC6 | 70.0 | deg F | NaN | 21.111111111111143 degree_Celsius | degF |
412399 | 11NPSWRD_WQX-SAMA_MEQ76_NC18 | 71.0 | deg F | NaN | 21.666666666666686 degree_Celsius | degF |
412402 | 11NPSWRD_WQX-SAMA_MEQ76_NC13 | 72.0 | deg F | NaN | 22.222222222222285 degree_Celsius | degF |
412417 | 11NPSWRD_WQX-SAIR_MEQ76_NC3 | 70.0 | deg F | NaN | 21.111111111111143 degree_Celsius | degF |
412418 | 11NPSWRD_WQX-SAMA_MEQ76_SB05 | 60.0 | deg F | NaN | 15.5555555555556 degree_Celsius | degF |
695 rows × 6 columns
In the above we can see examples where the results were in deg F and in the result field they’ve been converted into degree_Celsius
[35]:
# Examine missing units
temperature_results.loc[df['ResultMeasure/MeasureUnitCode'].isna()]
[35]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Temperature | Units | |
---|---|---|---|---|---|---|
100462 | AQUINNAH-MEN TXCO | 0 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 0.0 degree_Celsius | degC |
100463 | AQUINNAH-MEN TXCO | 0 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 0.0 degree_Celsius | degC |
100464 | AQUINNAH-MEN TXCO | 0 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 0.0 degree_Celsius | degC |
100465 | AQUINNAH-MEN TXCO | 0 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 0.0 degree_Celsius | degC |
100466 | AQUINNAH-MEN TXCO | 0 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 0.0 degree_Celsius | degC |
... | ... | ... | ... | ... | ... | ... |
101804 | AQUINNAH-MEN PND Q | 0 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 0.0 degree_Celsius | degC |
101805 | AQUINNAH-MEN PND Q | 0 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 0.0 degree_Celsius | degC |
101807 | AQUINNAH-MEN PND Q | 0 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 0.0 degree_Celsius | degC |
145824 | AQUINNAH-MEN PND HC | 0.0 | NaN | ResultMeasure/MeasureUnitCode: MISSING UNITS, ... | 0.0 degree_Celsius | degC |
379548 | USGS-414654070002901 | NaN | NaN | ResultMeasureValue: missing (NaN) result; Resu... | NaN | degC |
218 rows × 6 columns
We can see where the units were missing, the results were assumed to be in degree_Celsius already
[36]:
# This is also noted in the QA_flag field
list(temperature_results.loc[df['ResultMeasure/MeasureUnitCode'].isna(), 'QA_flag'])[0]
[36]:
'ResultMeasure/MeasureUnitCode: MISSING UNITS, degC assumed'
[37]:
# Look for any without usable results
temperature_results.loc[df['Temperature'].isna()]
[37]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Temperature | Units | |
---|---|---|---|---|---|---|
269 | MERRIMACK_RIVER_WATERSHED_WQX-Hayden-Schofield | 8 | count | NaN | NaN | count |
2958 | MERRIMACK_RIVER_WATERSHED_WQX-Manchester | 8 | count | NaN | NaN | count |
3439 | MERRIMACK_RIVER_WATERSHED_WQX-Nina-Scarito | 8 | count | NaN | NaN | count |
4263 | MERRIMACK_RIVER_WATERSHED_WQX-Misserville | 8 | count | NaN | NaN | count |
4561 | MERRIMACK_RIVER_WATERSHED_WQX-Ferrous | 8 | count | NaN | NaN | count |
... | ... | ... | ... | ... | ... | ... |
247574 | 11NPSWRD_WQX-SAIR_SWC_SR3 | NaN | deg C | ResultMeasureValue: missing (NaN) result | NaN | degC |
254544 | 11NPSWRD_WQX-SAIR_SWC_SR6 | NaN | deg C | ResultMeasureValue: missing (NaN) result | NaN | degC |
255792 | 11NPSWRD_WQX-SAIR_SWC_SR7 | NaN | deg C | ResultMeasureValue: missing (NaN) result | NaN | degC |
260273 | 11NPSWRD_WQX-SAIR_SWC_SRT9 | NaN | deg C | ResultMeasureValue: missing (NaN) result | NaN | degC |
379548 | USGS-414654070002901 | NaN | NaN | ResultMeasureValue: missing (NaN) result; Resu... | NaN | degC |
213 rows × 6 columns
[38]:
# Aggregate temperature data by station
visualize.station_summary(temperature_results, 'Temperature')
[38]:
MonitoringLocationIdentifier | cnt | mean | |
---|---|---|---|
0 | 11113300-00-SPB | 21 | 10.909524 |
1 | 11113300-00F-KLY | 3 | 20.233333 |
2 | 11113300-01-BAK | 10 | 20.600000 |
3 | 11113300-01-BVR | 1 | 21.000000 |
4 | 11113300-01-CTP | 39 | 24.438462 |
... | ... | ... | ... |
5928 | WWMD_VA-WWE1149PI | 1 | 18.800000 |
5929 | WWMD_VA-WWE1150PI | 1 | 21.300000 |
5930 | WWMD_VA-WWE1151PI | 2 | 20.700000 |
5931 | WWMD_VA-WWE1152PI | 1 | 20.000000 |
5932 | WWMD_VA-WWE1153PI | 1 | 20.800000 |
5933 rows × 3 columns
[39]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(temperature_results, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
[39]:
<Axes: >
[40]:
# Map average results at each station
gdf_avg = visualize.map_measure(temperature_results, stations_clipped, 'Temperature')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[40]:
<Axes: >
Dissolved Oxygen (DO)
[41]:
# look at Dissolved oxygen (DO), but this time without intermediate fields
df = harmonize.harmonize(df, 'Dissolved oxygen (DO)')
Note: Imediately when we run a harmonization function without the intermediate fields they’re deleted.
[42]:
# Look at what was changed
cols = ['MonitoringLocationIdentifier', 'ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'DO']
do_res = df.loc[df['CharacteristicName']=='Dissolved oxygen (DO)', cols]
do_res
[42]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | DO | |
---|---|---|---|---|---|
3 | IRWA-HB | 4.6 | mg/L | NaN | 4.6 milligram / liter |
6 | MYRWA-MEB001 | 9.4 | mg/L | NaN | 9.4 milligram / liter |
7 | NARS_WQX-MARO-1020 | 10.3 | mg/L | NaN | 10.3 milligram / liter |
9 | MASSDEP-W2412 | 8.7 | mg/L | NaN | 8.7 milligram / liter |
11 | MERRIMACK_RIVER_WATERSHED_WQX-Central Catholic | 10.91 | mg/L | NaN | 10.91 milligram / liter |
... | ... | ... | ... | ... | ... |
413047 | 11NPSWRD_WQX-CACO_GREAT_W | 6.1 | mg/L | NaN | 6.1 milligram / liter |
413050 | 11NPSWRD_WQX-SAMA_EPG_SHC | 14.2 | mg/L | NaN | 14.2 milligram / liter |
413051 | 11NPSWRD_WQX-SAMA_EPG_SHD | 8.5 | mg/L | NaN | 8.5 milligram / liter |
413057 | 11NPSWRD_WQX-SAMA_EPG_SHC | 9.1 | mg/L | NaN | 9.1 milligram / liter |
413061 | 11NPSWRD_WQX-SAMA_EPG_SHD | 9.2 | mg/L | NaN | 9.2 milligram / liter |
70434 rows × 5 columns
[43]:
do_res.loc[do_res['ResultMeasure/MeasureUnitCode']!='mg/l']
[43]:
MonitoringLocationIdentifier | ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | DO | |
---|---|---|---|---|---|
3 | IRWA-HB | 4.6 | mg/L | NaN | 4.6 milligram / liter |
6 | MYRWA-MEB001 | 9.4 | mg/L | NaN | 9.4 milligram / liter |
7 | NARS_WQX-MARO-1020 | 10.3 | mg/L | NaN | 10.3 milligram / liter |
9 | MASSDEP-W2412 | 8.7 | mg/L | NaN | 8.7 milligram / liter |
11 | MERRIMACK_RIVER_WATERSHED_WQX-Central Catholic | 10.91 | mg/L | NaN | 10.91 milligram / liter |
... | ... | ... | ... | ... | ... |
413047 | 11NPSWRD_WQX-CACO_GREAT_W | 6.1 | mg/L | NaN | 6.1 milligram / liter |
413050 | 11NPSWRD_WQX-SAMA_EPG_SHC | 14.2 | mg/L | NaN | 14.2 milligram / liter |
413051 | 11NPSWRD_WQX-SAMA_EPG_SHD | 8.5 | mg/L | NaN | 8.5 milligram / liter |
413057 | 11NPSWRD_WQX-SAMA_EPG_SHC | 9.1 | mg/L | NaN | 9.1 milligram / liter |
413061 | 11NPSWRD_WQX-SAMA_EPG_SHD | 9.2 | mg/L | NaN | 9.2 milligram / liter |
69137 rows × 5 columns
Though there were no results in %, the conversion from percent saturation (%) to mg/l is special. This equation is being improved by integrating tempertaure and pressure instead of assuming STP (see DO_saturation())
[44]:
# Aggregate data by station
visualize.station_summary(do_res, 'DO')
[44]:
MonitoringLocationIdentifier | cnt | mean | |
---|---|---|---|
0 | 11113300-00F-KLY | 3 | 4.040000 |
1 | 11113300-01-BAK | 10 | 5.697000 |
2 | 11113300-01-BVR | 1 | 6.700000 |
3 | 11113300-01-CTP | 37 | 6.945135 |
4 | 11113300-01-GOL | 1 | 4.400000 |
... | ... | ... | ... |
2323 | WWMD_VA-SR5 | 1 | 10.810000 |
2324 | WWMD_VA-WF2 | 2 | 6.800000 |
2325 | WWMD_VA-WI1 | 1 | 9.510000 |
2326 | WWMD_VA-WR2X | 2 | 6.000000 |
2327 | WWMD_VA-WR5 | 1 | 8.355000 |
2328 rows × 3 columns
[45]:
# Map number of usable results at each station
gdf_count = visualize.map_counts(do_res, stations_clipped)
gdf_count.plot(column='cnt', cmap='Blues', legend=True, scheme='quantiles', legend_kwds=legend_kwds)
[45]:
<Axes: >
[46]:
# Map average results at each station
gdf_avg = visualize.map_measure(do_res, stations_clipped, 'DO')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[46]:
<Axes: >
pH
[47]:
# pH, this time looking at a report
df = harmonize.harmonize(df, "pH", errors="ignore", report=True)
-Usable results-
count 94614.000000
mean 6.509265
std 2.641679
min -3.124705
25% 5.700000
50% 6.620000
75% 7.320000
max 715.000000
dtype: float64
Unusable results: 17
Usable results with inferred units: 215
Results outside threshold (0.0 to 22.35933967770454): 12
Note the warnings that occur when a unit is not recognized by the package. These occur even when report=False. Future versions could include these as defined units for pH, but here it wouldn’t alter results.
[48]:
df.loc[df['CharacteristicName']=='pH', ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'pH']]
[48]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | pH | |
---|---|---|---|---|
28 | 7.78 | None | NaN | 7.78 dimensionless |
52 | 7.9 | None | NaN | 7.9 dimensionless |
58 | 7.69 | None | NaN | 7.69 dimensionless |
59 | 6.16 | None | NaN | 6.16 dimensionless |
61 | 7.4 | None | NaN | 7.4 dimensionless |
... | ... | ... | ... | ... |
416595 | 5.8 | std units | NaN | 5.8 dimensionless |
416597 | 7.3 | std units | NaN | 7.3 dimensionless |
416601 | 7.4 | std units | NaN | 7.4 dimensionless |
416605 | 7.5 | std units | NaN | 7.5 dimensionless |
416609 | 7.4 | std units | NaN | 7.4 dimensionless |
94631 rows × 4 columns
‘None’ is uninterpretable and replaced with NaN, which then gets replaced with ‘dimensionless’ since pH is unitless
Salinity
[49]:
# Salinity
df = harmonize.harmonize(df, 'Salinity', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/basis.py:154: UserWarning: Mismatched ResultTemperatureBasisText: updated from 25 deg C to @25C (units)
warn(f"Mismatched {flag}", UserWarning)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'deg C' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
-Usable results-
count 30755.000000
mean 19.334149
std 27.164539
min 0.000000
25% 0.535000
50% 29.000000
75% 31.100000
max 4003.482834
dtype: float64
Unusable results: 217
Usable results with inferred units: 1
Results outside threshold (0.0 to 182.3213833602183): 2
[50]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity']
df.loc[df['CharacteristicName']=='Salinity', cols]
[50]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Salinity | |
---|---|---|---|---|
21 | 33.3 | ppth | NaN | 33.3 Practical_Salinity_Units |
49 | 32.7 | ppth | NaN | 32.7 Practical_Salinity_Units |
88 | 30.48 | ppt | NaN | 30.48 Practical_Salinity_Units |
96 | 0.21 | ppth | NaN | 0.21 Practical_Salinity_Units |
98 | 12.3 | ppth | NaN | 12.3 Practical_Salinity_Units |
... | ... | ... | ... | ... |
415403 | 9.9 | PSU | NaN | 9.9 Practical_Salinity_Units |
416008 | 4.6 | PSU | NaN | 4.6 Practical_Salinity_Units |
416108 | 21.0 | PSU | NaN | 21.0 Practical_Salinity_Units |
416121 | 18.7 | PSU | NaN | 18.7 Practical_Salinity_Units |
416232 | 10.1 | PSU | NaN | 10.1 Practical_Salinity_Units |
30972 rows × 4 columns
Nitrogen
[51]:
# Nitrogen
df = harmonize.harmonize(df, 'Nitrogen', report=True)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'cm3/g' UNDEFINED UNIT for Nitrogen
warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/construction.py:616: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
data = np.array(data, copy=copy)
-Usable results-
count 2409.000000
mean 6.472302
std 106.930304
min 0.000800
25% 0.074000
50% 0.242000
75% 1.000000
max 2800.000000
dtype: float64
Unusable results: 243
Usable results with inferred units: 0
Results outside threshold (0.0 to 648.0541246058881): 6
[52]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Nitrogen']
df.loc[df['CharacteristicName']=='Nitrogen', cols]
[52]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Nitrogen | |
---|---|---|---|---|
648 | 0.926976 | mg/L | NaN | 0.926976 milligram / liter |
751 | 1.571196 | mg/L | NaN | 1.571196 milligram / liter |
757 | 0.816144 | mg/L | NaN | 0.816144 milligram / liter |
1018 | 0.848832 | mg/L | NaN | 0.848832 milligram / liter |
1063 | 0.770448 | mg/L | NaN | 0.770448 milligram / liter |
... | ... | ... | ... | ... |
416428 | 0.119 | mg/l | NaN | 0.119 milligram / liter |
416486 | NaN | NaN | ResultMeasureValue: missing (NaN) result; Resu... | NaN |
416529 | NaN | NaN | ResultMeasureValue: missing (NaN) result; Resu... | NaN |
416535 | NaN | NaN | ResultMeasureValue: missing (NaN) result; Resu... | NaN |
416572 | NaN | NaN | ResultMeasureValue: missing (NaN) result; Resu... | NaN |
2652 rows × 4 columns
Conductivity
[53]:
# Conductivity
df = harmonize.harmonize(df, 'Conductivity', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'count' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
-Usable results-
count 2599.000000
mean 1093.016402
std 4249.270869
min 0.000000
25% 280.500000
50% 402.000000
75% 553.000000
max 48600.000000
dtype: float64
Unusable results: 80
Usable results with inferred units: 0
Results outside threshold (0.0 to 26588.641616518235): 29
[54]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Conductivity']
df.loc[df['CharacteristicName']=='Conductivity', cols]
[54]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Conductivity | |
---|---|---|---|---|
18 | 349 | uS/cm | NaN | 349.0 microsiemens / centimeter |
23 | 8 | count | NaN | NaN |
36 | 443 | uS/cm | NaN | 443.0 microsiemens / centimeter |
43 | 67.11703531 | uS/cm | NaN | 67.11703531 microsiemens / centimeter |
67 | 590 | uS/cm | NaN | 590.0 microsiemens / centimeter |
... | ... | ... | ... | ... |
310223 | 0.431 | mS/cm | NaN | 431.0 microsiemens / centimeter |
310246 | 0.321 | mS/cm | NaN | 321.0 microsiemens / centimeter |
310249 | 0.364 | mS/cm | NaN | 364.0 microsiemens / centimeter |
310265 | 0.246 | mS/cm | NaN | 246.0 microsiemens / centimeter |
310266 | 0.607 | mS/cm | NaN | 607.0 microsiemens / centimeter |
2679 rows × 4 columns
Chlorophyll a
[55]:
# Chlorophyll a
df = harmonize.harmonize(df, 'Chlorophyll a', report=True)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'ug/cm2' UNDEFINED UNIT for Chlorophyll
warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'ppb' UNDEFINED UNIT for Chlorophyll
warn("WARNING: " + problem)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/wq_data.py:395: UserWarning: WARNING: 'ug/m3' UNDEFINED UNIT for Chlorophyll
warn("WARNING: " + problem)
-Usable results-
count 4156.000000
mean 0.583832
std 3.286460
min -0.002400
25% 0.001900
50% 0.005200
75% 0.019000
max 92.900000
dtype: float64
Unusable results: 202
Usable results with inferred units: 9
Results outside threshold (0.0 to 20.302593411738652): 29
[56]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Chlorophyll']
df.loc[df['CharacteristicName']=='Chlorophyll a', cols]
[56]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Chlorophyll | |
---|---|---|---|---|
2 | 4.0 | mg/m3 | NaN | 0.004000000000000001 milligram / liter |
181 | 5.50 | ug/L | NaN | 0.0055 milligram / liter |
342 | 9.48 | ug/L | NaN | 0.00948 milligram / liter |
584 | 3.0 | mg/m3 | NaN | 0.003000000000000001 milligram / liter |
778 | NaN | mg/m3 | ResultMeasureValue: missing (NaN) result | NaN |
... | ... | ... | ... | ... |
416312 | 0.5 | ug/l | NaN | 0.0005 milligram / liter |
416319 | 2.5 | ug/l | NaN | 0.0025 milligram / liter |
416332 | 0.7 | ug/l | NaN | 0.0007 milligram / liter |
416365 | 1.1 | ug/l | NaN | 0.0011 milligram / liter |
416420 | 0.7 | ug/l | NaN | 0.0007 milligram / liter |
4358 rows × 4 columns
Organic Carbon
[57]:
# Organic carbon (%)
df = harmonize.harmonize(df, 'Organic carbon', report=True)
-Usable results-
count 6126.000000
mean 5300.199057
std 35634.927260
min -90000.000000
25% 1.720000
50% 4.900000
75% 8.900000
max 530000.000000
dtype: float64
Unusable results: 79
Usable results with inferred units: 0
Results outside threshold (0.0 to 219109.7626196926): 40
[58]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Carbon']
df.loc[df['CharacteristicName']=='Organic carbon', cols]
[58]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Carbon | |
---|---|---|---|---|
835 | 7.05 | mg/L | NaN | 7.05 milligram / liter |
945 | 7.57 | mg/L | NaN | 7.57 milligram / liter |
972 | 3.5 | mg/L | NaN | 3.5 milligram / liter |
2545 | 3.8 | mg/L | NaN | 3.8 milligram / liter |
3288 | 3.7 | mg/L | NaN | 3.7 milligram / liter |
... | ... | ... | ... | ... |
416416 | 7.01 | mg/l | NaN | 7.01 milligram / liter |
416485 | 5.96 | mg/l | NaN | 5.96 milligram / liter |
416528 | 4.1 | mg/l | NaN | 4.1 milligram / liter |
416534 | 7.79 | mg/l | NaN | 7.79 milligram / liter |
416571 | 5.79 | mg/l | NaN | 5.79 milligram / liter |
6205 rows × 4 columns
Turbidity (NTU)
[59]:
# Turbidity (NTU)
df = harmonize.harmonize(df, 'Turbidity', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/harmonize.py:149: UserWarning: Bad Turbidity unit: count
warn(f"Bad Turbidity unit: {unit}")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'count' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
-Usable results-
count 25833.000000
mean 20.034423
std 82.668605
min -999.000000
25% 1.550000
50% 3.200000
75% 7.100000
max 4100.000000
dtype: float64
Unusable results: 547
Usable results with inferred units: 275
Results outside threshold (0.0 to 516.0460525509235): 168
[60]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Turbidity']
df.loc[df['CharacteristicName']=='Turbidity', cols]
[60]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Turbidity | |
---|---|---|---|---|
1 | 4.2 | NTU | NaN | 4.2 Nephelometric_Turbidity_Units |
5 | 1.3 | NTU | NaN | 1.3 Nephelometric_Turbidity_Units |
10 | 7.9 | NTU | NaN | 7.9 Nephelometric_Turbidity_Units |
20 | 1.5 | NTU | NaN | 1.5 Nephelometric_Turbidity_Units |
25 | 2.1 | NTU | NaN | 2.1 Nephelometric_Turbidity_Units |
... | ... | ... | ... | ... |
416589 | 1.6 | NTRU | NaN | 1.6 Nephelometric_Turbidity_Units |
416599 | 5.8 | NTRU | NaN | 5.8 Nephelometric_Turbidity_Units |
416603 | 2.1 | NTRU | NaN | 2.1 Nephelometric_Turbidity_Units |
416607 | 2.9 | NTRU | NaN | 2.9 Nephelometric_Turbidity_Units |
416611 | 2.7 | NTRU | NaN | 2.7 Nephelometric_Turbidity_Units |
26380 rows × 4 columns
Sediment
[61]:
# Sediment
df = harmonize.harmonize(df, 'Sediment', report=False, errors='ignore')
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'g / H2O' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
[62]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Sediment']
df.loc[df['CharacteristicName']=='Sediment', cols]
[62]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Sediment | |
---|---|---|---|---|
374162 | 0.012 | g | NaN | NaN |
374163 | 0.0037 | g | NaN | NaN |
374164 | 0.0048 | g | NaN | NaN |
374165 | 0.001 | g | NaN | NaN |
374166 | 0.0088 | g | NaN | NaN |
... | ... | ... | ... | ... |
415678 | 0.0051 | g | NaN | NaN |
415752 | 0.0025 | g | NaN | NaN |
415824 | 0.002 | g | NaN | NaN |
415863 | 0.0023 | g | NaN | NaN |
415909 | 0.0014 | g | NaN | NaN |
4410 rows × 4 columns
Phosphorus
Note: must be merged w/ activities (package runs query by site if not already merged)
[63]:
# Phosphorus
df = harmonize.harmonize(df, 'Phosphorus')
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/construction.py:616: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
data = np.array(data, copy=copy)
2 Phosphorus sample fractions not in frac_dict
2 Phosphorus sample fractions not in frac_dict found in expected domains, mapped to "Other_Phosphorus"
Note: warnings for unexpected characteristic fractions. Fractions are each seperated out into their own result column.
[64]:
# All Phosphorus
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'TDP_Phosphorus']
df.loc[df['Phosphorus'].notna(), cols]
[64]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | TDP_Phosphorus | |
---|---|---|---|---|
27 | 0.024 | mg/L | NaN | NaN |
30 | 0.05 | mg/L | NaN | NaN |
38 | 0.027 | mg/L | NaN | NaN |
46 | 0.059712 | mg/L | NaN | NaN |
93 | 0.049 | mg/L | NaN | NaN |
... | ... | ... | ... | ... |
415885 | 0.006 | mg/l as P | NaN | 0.006 milligram / liter |
415897 | 0.016 | mg/l as P | NaN | NaN |
415898 | 0.008 | mg/l as P | NaN | 0.008 milligram / liter |
415905 | 0.047 | mg/l as P | NaN | NaN |
415906 | 0.028 | mg/l as P | NaN | 0.028 milligram / liter |
15994 rows × 4 columns
[65]:
# Total phosphorus
df.loc[df['TP_Phosphorus'].notna(), cols]
[65]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | TDP_Phosphorus | |
---|---|---|---|---|
27 | 0.024 | mg/L | NaN | NaN |
30 | 0.05 | mg/L | NaN | NaN |
38 | 0.027 | mg/L | NaN | NaN |
46 | 0.059712 | mg/L | NaN | NaN |
93 | 0.049 | mg/L | NaN | NaN |
... | ... | ... | ... | ... |
415859 | 0.036 | mg/l as P | NaN | NaN |
415870 | 0.021 | mg/l as P | NaN | NaN |
415884 | 0.011 | mg/l as P | NaN | NaN |
415897 | 0.016 | mg/l as P | NaN | NaN |
415905 | 0.047 | mg/l as P | NaN | NaN |
10721 rows × 4 columns
[66]:
# Total dissolved phosphorus
df.loc[df['TDP_Phosphorus'].notna(), cols]
[66]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | TDP_Phosphorus | |
---|---|---|---|---|
40115 | 0.023 | ppm | NaN | 0.023000000000000003 milligram / liter |
42563 | 0.035 | ppm | NaN | 0.03500000000000001 milligram / liter |
43712 | 0.017 | ppm | NaN | 0.017000000000000005 milligram / liter |
43817 | 0.015 | ppm | NaN | 0.015000000000000003 milligram / liter |
44687 | 0.015 | ppm | NaN | 0.015000000000000003 milligram / liter |
... | ... | ... | ... | ... |
415860 | 0.015 | mg/l as P | NaN | 0.015 milligram / liter |
415871 | 0.005 | mg/l as P | NaN | 0.005 milligram / liter |
415885 | 0.006 | mg/l as P | NaN | 0.006 milligram / liter |
415898 | 0.008 | mg/l as P | NaN | 0.008 milligram / liter |
415906 | 0.028 | mg/l as P | NaN | 0.028 milligram / liter |
4857 rows × 4 columns
[67]:
# All other phosphorus sample fractions
df.loc[df['Other_Phosphorus'].notna(), cols]
[67]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | TDP_Phosphorus | |
---|---|---|---|---|
29186 | 0.03525375 | mg/L | NaN | NaN |
29294 | 0.107028125 | mg/L | NaN | NaN |
29827 | 0.0697675 | mg/L | NaN | NaN |
30045 | 0.04356 | mg/L | NaN | NaN |
30194 | 0.03654875 | mg/L | NaN | NaN |
... | ... | ... | ... | ... |
405967 | 530.0 | mg/kg | NaN | NaN |
407304 | 280.0 | mg/kg | NaN | NaN |
408078 | 270.0 | mg/kg | NaN | NaN |
408324 | 260.0 | mg/kg | NaN | NaN |
408593 | 1300.0 | mg/kg | NaN | NaN |
416 rows × 4 columns
Bacteria
Some equivalence assumptions are built-in where bacteria counts that are not equivalent are treated as such because there is no standard way to convert from one to another.
Fecal Coliform
[68]:
# Known unit with bad dimensionality ('Colony_Forming_Units * milliliter')
df = harmonize.harmonize(df, 'Fecal Coliform', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
-Usable results-
count 2462.000000
mean 910.512591
std 6103.365912
min 0.000000
25% 24.000000
50% 93.000000
75% 400.000000
max 250000.000000
dtype: float64
Unusable results: 5762
Usable results with inferred units: 1
Results outside threshold (0.0 to 37530.70806108734): 7
[69]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Fecal_Coliform']
df.loc[df['CharacteristicName']=='Fecal Coliform', cols]
[69]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Fecal_Coliform | |
---|---|---|---|---|
16318 | 40 | #/100mL | NaN | NaN |
16376 | 9 | #/100mL | NaN | NaN |
16385 | 20 | #/100mL | NaN | NaN |
16533 | 200 | #/100mL | NaN | NaN |
16697 | NO DATA | #/100mL | ResultMeasureValue: "NO DATA" result cannot be... | NaN |
... | ... | ... | ... | ... |
412403 | 2400.0 | cfu/100mL | NaN | NaN |
412405 | 230.0 | cfu/100mL | NaN | NaN |
412409 | 30.0 | cfu/100mL | NaN | NaN |
412413 | NaN | cfu/100mL | ResultMeasureValue: missing (NaN) result | NaN |
413158 | 1.0 | cfu/100ml | NaN | 1.0 Colony_Forming_Units / milliliter |
8224 rows × 4 columns
Excherichia Coli
[70]:
# Known unit with bad dimensionality ('Colony_Forming_Units * milliliter')
df = harmonize.harmonize(df, 'Escherichia coli', report=True, errors='ignore')
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: '%' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'MPN/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'cfu/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'CFU/100mL' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/harmonize_wq/convert.py:128: UserWarning: WARNING: 'count' converted to NaN
warn(f"WARNING: '{unit}' converted to NaN")
-Usable results-
count 1278.000000
mean 626.068286
std 3462.583347
min 0.000000
25% 22.250000
50% 66.000000
75% 220.000000
max 72000.000000
dtype: float64
Unusable results: 22187
Usable results with inferred units: 0
Results outside threshold (0.0 to 21401.56837068052): 7
[71]:
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'E_coli']
df.loc[df['CharacteristicName']=='Escherichia coli', cols]
[71]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | E_coli | |
---|---|---|---|---|
4 | 96 | MPN/100mL | NaN | NaN |
13 | 24200 | MPN/100mL | NaN | NaN |
16 | 110 | MPN/100mL | NaN | NaN |
19 | 52 | MPN/100mL | NaN | NaN |
24 | 52 | MPN/100mL | NaN | NaN |
... | ... | ... | ... | ... |
416585 | 7.0 | MPN/100 ml | NaN | 7.0 Colony_Forming_Units / milliliter |
416598 | 870.0 | MPN/100 ml | NaN | 870.0 Colony_Forming_Units / milliliter |
416602 | 44.0 | MPN/100 ml | NaN | 44.0 Colony_Forming_Units / milliliter |
416606 | 14.0 | MPN/100 ml | NaN | 14.0 Colony_Forming_Units / milliliter |
416610 | 390.0 | MPN/100 ml | NaN | 390.0 Colony_Forming_Units / milliliter |
23465 rows × 4 columns
Combining Salinity and Conductivity
Convert module has various functions to convert from one unit or characteristic to another. Some of these are used within a single characteristic during harmonization (e.g. DO saturation to concentration) while others are intended to model one characteristic as an indicator of another (e.g. estimate salinity from conductivity).
Note: this should only be done after both characteristic fields have been harmonized. Results before and after should be inspected, thresholds for outliers applied, and consider adding a QA_flag for modeled data.
Explore Salinity results:
[72]:
from harmonize_wq import convert
[73]:
# First note initial Salinity info
lst = [x.magnitude for x in list(df['Salinity'].dropna())]
q_sum = sum(lst)
print('Range: {} to {}'.format(min(lst), max(lst)))
print('Results: {} \nMean: {} PSU'.format(len(lst), q_sum/len(lst)))
Range: 0.0 to 4003.4828342857154
Results: 30755
Mean: 19.334148801040765 PSU
[74]:
# Identify extreme outliers
[x for x in lst if x >3200]
[74]:
[4003.4828342857154]
Other fields like units and QA_flag may help understand what caused high values and what results might need to be dropped from consideration
[75]:
# Columns to focus on
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity']
[76]:
# Look at important fields for max 5 values
salinity_series = df['Salinity'][df['Salinity'].notna()]
salinity_series.sort_values(ascending=False, inplace=True)
df[cols][df['Salinity'].isin(salinity_series[0:5])]
[76]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Salinity | |
---|---|---|---|---|
22981 | 804 | ppth | NaN | 804.0 Practical_Salinity_Units |
145380 | 71.49 | ppth | NaN | 71.49 Practical_Salinity_Units |
146389 | 70.62 | ppth | NaN | 70.62 Practical_Salinity_Units |
178350 | 77.6666666666667 | ppt | NaN | 77.6666666666667 Practical_Salinity_Units |
382543 | 4980.0 | mg/mL @25C | ResultTemperatureBasisText: updated from 25 de... | 4003.4828342857154 Practical_Salinity_Units |
Detection limits may help understand what caused low values and what results might need to be dropped or updated
[77]:
df = wrangle.add_detection(df, 'Salinity')
cols+=['ResultDetectionConditionText',
'DetectionQuantitationLimitTypeName',
'DetectionQuantitationLimitMeasure/MeasureValue',
'DetectionQuantitationLimitMeasure/MeasureUnitCode']
[78]:
# Look at important fields for min 5 values (often multiple 0.0)
df[cols][df['Salinity'].isin(salinity_series[-5:])]
[78]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Salinity | ResultDetectionConditionText | DetectionQuantitationLimitTypeName | DetectionQuantitationLimitMeasure/MeasureValue | DetectionQuantitationLimitMeasure/MeasureUnitCode | |
---|---|---|---|---|---|---|---|---|
10817 | 0 | ppm | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
12039 | 0 | ppm | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
13335 | 0 | ppm | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
19164 | 0 | ppt | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
19615 | 0 | PSS | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... |
323504 | 0.0 | ppth | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
323598 | 0.0 | ppth | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
323652 | 0.0 | ppth | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
323657 | 0.0 | ppth | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
323682 | 0.0 | ppth | NaN | 0.0 Practical_Salinity_Units | NaN | NaN | NaN | NaN |
649 rows × 8 columns
Explore conductivity results:
[79]:
# Create series for Conductivity values
cond_series = df['Conductivity'].dropna()
cond_series
[79]:
18 349.0 microsiemens / centimeter
36 443.0 microsiemens / centimeter
43 67.11703531 microsiemens / centimeter
67 590.0 microsiemens / centimeter
109 418.375 microsiemens / centimeter
...
310223 431.0 microsiemens / centimeter
310246 321.0 microsiemens / centimeter
310249 364.0 microsiemens / centimeter
310265 246.0 microsiemens / centimeter
310266 607.0 microsiemens / centimeter
Name: Conductivity, Length: 2599, dtype: object
Conductivity thresholds from Freshwater Explorer: 10 > x < 5000 us/cm, use a higher threshold for coastal waters
[80]:
# Sort and check other relevant columns before converting (e.g. Salinity)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'QA_flag', 'Salinity', 'Conductivity']
df.sort_values(by=['Conductivity'], ascending=False, inplace=True)
df.loc[df['Conductivity'].notna(), cols]
[80]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | QA_flag | Salinity | Conductivity | |
---|---|---|---|---|---|
105835 | 48.6 | mS/cm | NaN | NaN | 48600.0 microsiemens / centimeter |
107360 | 48.15 | mS/cm | NaN | NaN | 48150.0 microsiemens / centimeter |
69499 | 48 | mS/cm | NaN | NaN | 48000.0 microsiemens / centimeter |
74291 | 46.9 | mS/cm | NaN | NaN | 46900.0 microsiemens / centimeter |
106347 | 46.8 | mS/cm | NaN | NaN | 46800.0 microsiemens / centimeter |
... | ... | ... | ... | ... | ... |
4824 | 0.03 | uS/cm | NaN | NaN | 0.03 microsiemens / centimeter |
2738 | 0.02 | uS/cm | NaN | NaN | 0.02 microsiemens / centimeter |
5287 | 0.02 | uS/cm | NaN | NaN | 0.02 microsiemens / centimeter |
3156 | 0 | uS/cm | NaN | NaN | 0.0 microsiemens / centimeter |
6715 | 0 | uS/cm | NaN | NaN | 0.0 microsiemens / centimeter |
2599 rows × 5 columns
[81]:
# Convert values to PSU and write to Salinity
cond_series = cond_series.apply(str) # Convert to string to convert to dimensionless (PSU)
df.loc[df['Conductivity'].notna(), 'Salinity'] = cond_series.apply(convert.conductivity_to_PSU)
df.loc[df['Conductivity'].notna(), 'Salinity']
[81]:
105835 31.712 dimensionless
107360 31.384 dimensionless
69499 31.274 dimensionless
74291 30.474 dimensionless
106347 30.401 dimensionless
...
4824 0.012 dimensionless
2738 0.012 dimensionless
5287 0.012 dimensionless
3156 0.012 dimensionless
6715 0.012 dimensionless
Name: Salinity, Length: 2599, dtype: object
Datetime
datetime() formats time using dataretrieval and ActivityStart
[82]:
# First inspect the existing unformated fields
cols = ['ActivityStartDate', 'ActivityStartTime/Time', 'ActivityStartTime/TimeZoneCode']
df[cols]
[82]:
ActivityStartDate | ActivityStartTime/Time | ActivityStartTime/TimeZoneCode | |
---|---|---|---|
105835 | 2007-08-23 | 09:23:00 | EST |
107360 | 2007-08-09 | 08:14:00 | EST |
69499 | 2006-08-21 | 11:50:00 | EST |
74291 | 2006-08-07 | 10:00:00 | EST |
106347 | 2007-08-09 | 10:20:00 | EST |
... | ... | ... | ... |
416607 | 2024-01-08 | 12:00:00 | EST |
416608 | 2024-02-12 | 08:45:00 | EST |
416609 | 2024-02-12 | 08:45:00 | EST |
416610 | 2024-02-12 | 08:45:00 | EST |
416611 | 2024-02-12 | 08:45:00 | EST |
416612 rows × 3 columns
[83]:
# 'ActivityStartDate' presserves date where 'Activity_datetime' is NAT due to no time zone
df = clean.datetime(df)
df[['ActivityStartDate', 'Activity_datetime']]
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/dataretrieval/utils.py:87: UserWarning: Warning: 132776 incomplete dates found, consider setting datetime_index to False.
warnings.warn(
[83]:
ActivityStartDate | Activity_datetime | |
---|---|---|
105835 | 2007-08-23 | 2007-08-23 14:23:00+00:00 |
107360 | 2007-08-09 | 2007-08-09 13:14:00+00:00 |
69499 | 2006-08-21 | 2006-08-21 16:50:00+00:00 |
74291 | 2006-08-07 | 2006-08-07 15:00:00+00:00 |
106347 | 2007-08-09 | 2007-08-09 15:20:00+00:00 |
... | ... | ... |
416607 | 2024-01-08 | 2024-01-08 17:00:00+00:00 |
416608 | 2024-02-12 | 2024-02-12 13:45:00+00:00 |
416609 | 2024-02-12 | 2024-02-12 13:45:00+00:00 |
416610 | 2024-02-12 | 2024-02-12 13:45:00+00:00 |
416611 | 2024-02-12 | 2024-02-12 13:45:00+00:00 |
416612 rows × 2 columns
Activity_datetime combines all three time component columns into UTC. If time is missing this is NaT so a ActivityStartDate column is used to preserve date only.
Depth
Note: Data are often lacking sample depth metadata
[84]:
# Depth of sample (default units='meter')
df = clean.harmonize_depth(df)
#df.loc[df['ResultDepthHeightMeasure/MeasureValue'].dropna(), "Depth"]
df['ResultDepthHeightMeasure/MeasureValue'].dropna()
[84]:
105835 0.15
107360 0.15
69499 0.15
74291 0.15
106347 0.15
...
282332 0.15
282381 0.15
282388 0.15
282416 0.15
282452 0.15
Name: ResultDepthHeightMeasure/MeasureValue, Length: 2460, dtype: float64
Characteristic to Column (long to wide format)
[85]:
# Split single QA column into multiple by characteristic (rename the result to preserve these QA_flags)
df2 = wrangle.split_col(df)
df2
[85]:
OrganizationIdentifier | OrganizationFormalName | ActivityIdentifier | ActivityStartDate | ActivityStartTime/Time | ActivityStartTime/TimeZoneCode | MonitoringLocationIdentifier | ResultIdentifier | DataLoggerLine | ResultDetectionConditionText | ... | QA_Secchi | QA_Temperature | QA_Nitrogen | QA_TP_Phosphorus | QA_TDP_Phosphorus | QA_Other_Phosphorus | QA_Turbidity | QA_Carbon | QA_Conductivity | QA_Chlorophyll | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
105835 | WWMD_VA | WQX Test Organization | WWMD_VA-BI1:20070823092300:SO:1.5:WB | 2007-08-23 | 09:23:00 | -0500 | WWMD_VA-BI1 | STORET-1013198907 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
107360 | WWMD_VA | WQX Test Organization | WWMD_VA-BI1:20070809081400:SO:1.7:WB | 2007-08-09 | 08:14:00 | -0500 | WWMD_VA-BI1 | STORET-1013198861 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
69499 | WWMD_VA | WQX Test Organization | WWMD_VA-SR6A:20060821115000:SO:0.8:WB | 2006-08-21 | 11:50:00 | -0500 | WWMD_VA-SR6A | STORET-1013198845 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
74291 | WWMD_VA | WQX Test Organization | WWMD_VA-OB6:20060807100000:SO:1.7:WB | 2006-08-07 | 10:00:00 | -0500 | WWMD_VA-OB6 | STORET-1013198765 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
106347 | WWMD_VA | WQX Test Organization | WWMD_VA-LT1N:20070809102000:SO:0.75:WB | 2007-08-09 | 10:20:00 | -0500 | WWMD_VA-LT1N | STORET-1013198891 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
416607 | USGS-MA | USGS Massachusetts Water Science Center | nwisma.01.02400495 | 2024-01-08 | 12:00:00 | -0500 | USGS-01098530 | NWIS-126836828 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
416608 | USGS-MA | USGS Massachusetts Water Science Center | nwisma.01.02400639 | 2024-02-12 | 08:45:00 | -0500 | USGS-01097050 | NWIS-126989046 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
416609 | USGS-MA | USGS Massachusetts Water Science Center | nwisma.01.02400639 | 2024-02-12 | 08:45:00 | -0500 | USGS-01097050 | NWIS-126989056 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
416610 | USGS-MA | USGS Massachusetts Water Science Center | nwisma.01.02400639 | 2024-02-12 | 08:45:00 | -0500 | USGS-01097050 | NWIS-126989063 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
416611 | USGS-MA | USGS Massachusetts Water Science Center | nwisma.01.02400639 | 2024-02-12 | 08:45:00 | -0500 | USGS-01097050 | NWIS-126989064 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
380739 rows × 118 columns
[86]:
# This expands the single col (QA_flag) out to a number of new columns based on the unique characteristicNames and speciation
print('{} new columns'.format(len(df2.columns) - len(df.columns)))
15 new columns
[87]:
# Note: there are fewer rows because NAN results are also dropped in this step
print('{} fewer rows'.format(len(df)-len(df2)))
35873 fewer rows
[88]:
#Examine Carbon flags from earlier in notebook (note these are empty now because NAN is dropped)
cols = ['ResultMeasureValue', 'ResultMeasure/MeasureUnitCode', 'Carbon', 'QA_Carbon']
df2.loc[df2['QA_Carbon'].notna(), cols]
[88]:
ResultMeasureValue | ResultMeasure/MeasureUnitCode | Carbon | QA_Carbon |
---|
Next the table is divided into the columns of interest (main_df) and characteristic specific metadata (chars_df)
[89]:
# split table into main and characteristics tables
main_df, chars_df = wrangle.split_table(df2)
[90]:
# Columns still in main table
main_df.columns
[90]:
Index(['OrganizationIdentifier', 'OrganizationFormalName',
'ActivityIdentifier', 'MonitoringLocationIdentifier', 'ProviderName',
'Secchi', 'Temperature', 'DO', 'pH', 'Salinity', 'Nitrogen',
'Speciation', 'TOTAL NITROGEN_ MIXED FORMS', 'Conductivity',
'Chlorophyll', 'Carbon', 'Turbidity', 'Sediment', 'Phosphorus',
'TP_Phosphorus', 'TDP_Phosphorus', 'Other_Phosphorus', 'Fecal_Coliform',
'E_coli', 'DetectionQuantitationLimitTypeName',
'DetectionQuantitationLimitMeasure/MeasureValue',
'DetectionQuantitationLimitMeasure/MeasureUnitCode',
'Activity_datetime', 'Depth', 'QA_pH', 'QA_E_coli', 'QA_Salinity',
'QA_Fecal_Coliform', 'QA_Sediment', 'QA_DO', 'QA_Secchi',
'QA_Temperature', 'QA_Nitrogen', 'QA_TP_Phosphorus',
'QA_TDP_Phosphorus', 'QA_Other_Phosphorus', 'QA_Turbidity', 'QA_Carbon',
'QA_Conductivity', 'QA_Chlorophyll'],
dtype='object')
[91]:
# look at main table results (first 5)
main_df.head()
[91]:
OrganizationIdentifier | OrganizationFormalName | ActivityIdentifier | MonitoringLocationIdentifier | ProviderName | Secchi | Temperature | DO | pH | Salinity | ... | QA_Secchi | QA_Temperature | QA_Nitrogen | QA_TP_Phosphorus | QA_TDP_Phosphorus | QA_Other_Phosphorus | QA_Turbidity | QA_Carbon | QA_Conductivity | QA_Chlorophyll | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
105835 | WWMD_VA | WQX Test Organization | WWMD_VA-BI1:20070823092300:SO:1.5:WB | WWMD_VA-BI1 | STORET | NaN | NaN | NaN | NaN | 31.712 dimensionless | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
107360 | WWMD_VA | WQX Test Organization | WWMD_VA-BI1:20070809081400:SO:1.7:WB | WWMD_VA-BI1 | STORET | NaN | NaN | NaN | NaN | 31.384 dimensionless | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
69499 | WWMD_VA | WQX Test Organization | WWMD_VA-SR6A:20060821115000:SO:0.8:WB | WWMD_VA-SR6A | STORET | NaN | NaN | NaN | NaN | 31.274 dimensionless | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
74291 | WWMD_VA | WQX Test Organization | WWMD_VA-OB6:20060807100000:SO:1.7:WB | WWMD_VA-OB6 | STORET | NaN | NaN | NaN | NaN | 30.474 dimensionless | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
106347 | WWMD_VA | WQX Test Organization | WWMD_VA-LT1N:20070809102000:SO:0.75:WB | WWMD_VA-LT1N | STORET | NaN | NaN | NaN | NaN | 30.401 dimensionless | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 45 columns
[92]:
# Empty columns that could be dropped (Mostly QA columns)
cols = list(main_df.columns)
x = main_df.dropna(axis=1, how='all')
[col for col in cols if col not in x.columns]
[92]:
['Sediment',
'DetectionQuantitationLimitTypeName',
'DetectionQuantitationLimitMeasure/MeasureValue',
'DetectionQuantitationLimitMeasure/MeasureUnitCode',
'QA_E_coli',
'QA_Sediment',
'QA_DO',
'QA_TDP_Phosphorus',
'QA_Carbon',
'QA_Conductivity']
[93]:
# Map average results at each station
gdf_avg = visualize.map_measure(main_df, stations_clipped, 'Temperature')
gdf_avg.plot(column='mean', cmap='OrRd', legend=True)
[93]:
<Axes: >