harmonize_wq package

harmonize_wq.basis module

Functions to process characteristic basis or return basis dictionary.

harmonize_wq.basis.unit_basis_dict

Characteristic specific basis dictionary to define basis from units.

Notes

Dictionary with logic for determining basis from units string and standard pint units to replace those with. The structure is {Basis: {standard units: [unit strings with basis]}}.

The out_col is often derived from WQCharData.char_val. The desired basis can be used as a key to subset result.

Examples

Get dictionary for Phosphorus and subset for ‘as P’:

>>> from harmonize_wq import basis
>>> basis.unit_basis_dict['Phosphorus']['as P']
{'mg/l': ['mg/l as P', 'mg/l P'], 'mg/kg': ['mg/kg as P', 'mg/kg P']}
Type:

dict

harmonize_wq.basis.basis_conversion

Get dictionary of conversion factors to convert basis/speciation. For example, this is used to convert ‘as PO4’ to ‘as P’. Dictionary structure {basis: conversion factor}.

Type:

dict

harmonize_wq.basis.stp_dict

Get standard temperature and pressure to define basis from units. Dictionary structure {‘standard temp’ : {‘units’: [values to replace]}}.

Notes

This needs to be updated to include pressure or needs to be renamed.

Type:

dict

harmonize_wq.basis.basis_from_method_spec(df_in)

Copy speciation from MethodSpecificationName to new ‘Speciation’ column.

Parameters:

df_in (pandas.DataFrame) – DataFrame that will be updated.

Returns:

df – Updated copy of df_in.

Return type:

pandas.DataFrame

Examples

Build pandas DataFrame for example:

>>> from pandas import DataFrame
>>> from numpy import nan
>>> df = DataFrame({'CharacteristicName': ['Phosphorus', 'Phosphorus',],
...                 'MethodSpecificationName': ['as P', nan],
...                 'ProviderName': ['NWIS', 'NWIS',],
...                 })
>>> df
  CharacteristicName MethodSpecificationName ProviderName
0         Phosphorus                    as P         NWIS
1         Phosphorus                     NaN         NWIS
>>> from harmonize_wq import basis
>>> basis.basis_from_method_spec(df)
  CharacteristicName MethodSpecificationName ProviderName Speciation
0         Phosphorus                    as P         NWIS       as P
1         Phosphorus                     NaN         NWIS        NaN
harmonize_wq.basis.basis_from_unit(df_in, basis_dict, unit_col='Units', basis_col='Speciation')

Move basis from units to basis column in pandas.DataFrame.

Move basis information from units in unit_col column to basis in basis_col column based on basis_dict. If basis_col does not exist in df_in it will be created. The unit_col column is updated in place. To maintain data integrity unit_col should not be the original ‘ResultMeasure/MeasureUnitCode’ column.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame that will be updated.

  • basis_dict (dict) – Dictionary with structure {basis:{new_unit:[old_units]}}.

  • unit_col (str, optional) – String for the units column name in df_in to use. The default is ‘Units’.

  • basis_col (str, optional) – String for the basis column name in df_in to use. The default is ‘Speciation’.

Returns:

df – Updated copy of df_in.

Return type:

pandas.DataFrame

Examples

Build pandas DataFrame for example:

>>> from pandas import DataFrame
>>> df = DataFrame({'CharacteristicName': ['Phosphorus', 'Phosphorus',],
...                 'ResultMeasure/MeasureUnitCode': ['mg/l as P', 'mg/kg as P'],
...                 'Units':  ['mg/l as P', 'mg/kg as P'],
...                 })
>>> df
  CharacteristicName ResultMeasure/MeasureUnitCode       Units
0         Phosphorus                     mg/l as P   mg/l as P
1         Phosphorus                    mg/kg as P  mg/kg as P
>>> from harmonize_wq import basis
>>> basis_dict = basis.unit_basis_dict['Phosphorus']
>>> unit_col = 'Units'
>>> basis.basis_from_unit(df, basis_dict, unit_col)
  CharacteristicName ResultMeasure/MeasureUnitCode  Units Speciation
0         Phosphorus                     mg/l as P   mg/l       as P
1         Phosphorus                    mg/kg as P  mg/kg       as P

If an existing basis_col value is different, a warning is issued when it is updated and a QA_flag is assigned:

>>> from numpy import nan
>>> df['Speciation'] = [nan, 'as PO4']
>>> df_speciation_change = basis.basis_from_unit(df, basis_dict, unit_col)
... 
UserWarning: Mismatched Speciation: updated from as PO4 to as P (units)
>>> df_speciation_change[['Speciation', 'QA_flag']]
  Speciation                                          QA_flag
0       as P                                              NaN
1       as P  Speciation: updated from as PO4 to as P (units)
harmonize_wq.basis.set_basis(df_in, mask, basis, basis_col='Speciation')

Update or create basis_col with basis as value.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame that will be updated.

  • mask (pandas.Series) – Row conditional mask to limit rows (e.g. to specific unit/speciation).

  • basis (str) – The string to use for basis.

  • basis_col (str, optional) – The new or existing column for basis string. The default is ‘Speciation’.

Returns:

df_out – Updated copy of df_in.

Return type:

pandas.DataFrame

Examples

Build pandas DataFrame for example:

>>> from pandas import DataFrame
>>> df = DataFrame({'CharacteristicName': ['Phosphorus',
...                                        'Phosphorus',
...                                        'Salinity'],
...                 'MethodSpecificationName': ['as P', 'as PO4', ''],
...                 })
>>> df  
  CharacteristicName MethodSpecificationName
0         Phosphorus                    as P
1         Phosphorus                  as PO4
2           Salinity

Build mask for example:

>>> mask = df['CharacteristicName']=='Phosphorus'
>>> from harmonize_wq import basis
>>> basis.set_basis(df, mask, basis='as P')
  CharacteristicName MethodSpecificationName Speciation
0         Phosphorus                    as P       as P
1         Phosphorus                  as PO4       as P
2           Salinity                                NaN
harmonize_wq.basis.update_result_basis(df_in, basis_col, unit_col)

Move basis from unit_col column to basis_col column.

This is usually used in place of basis_from_unit when the basis_col is not ‘ResultMeasure/MeasureUnitCode’ (i.e., not speciation).

Notes

Currently overwrites the original basis_col values rather than create many new empty columns. The original values are noted in the QA_flag.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame that will be updated.

  • basis_col (str) – Column in df_in with result basis to update. Expected values are ‘ResultTemperatureBasisText’.

  • unit_col (str) – Column in df_in with units that may contain basis.

Returns:

df_out – Updated copy of df_in.

Return type:

pandas.DataFrame

Examples

Build pandas DataFrame for example:

>>> from pandas import DataFrame
>>> from numpy import nan
>>> df = DataFrame({'CharacteristicName': ['Salinity', 'Salinity',],
...                 'ResultTemperatureBasisText': ['25 deg C', nan,],
...                 'Units':  ['mg/mL @25C', 'mg/mL @25C'],
...                 })
>>> df
  CharacteristicName ResultTemperatureBasisText       Units
0           Salinity                   25 deg C  mg/mL @25C
1           Salinity                        NaN  mg/mL @25C
>>> from harmonize_wq import basis
>>> df_temp_basis = basis.update_result_basis(df,
...                                           'ResultTemperatureBasisText',
...                                           'Units')
... 
UserWarning: Mismatched ResultTemperatureBasisText: updated from 25 deg C to @25C
(units)
>>> df_temp_basis[['Units']]
   Units
0  mg/mL
1  mg/mL
>>> df_temp_basis[['ResultTemperatureBasisText', 'QA_flag']]
  ResultTemperatureBasisText                                            QA_flag
0                       @25C  ResultTemperatureBasisText: updated from 25 de...
1                       @25C                                                NaN

harmonize_wq.clean module

Functions to clean/correct additional columns in subset/entire dataset.

harmonize_wq.clean.add_qa_flag(df_in, mask, flag)

Add flag to ‘QA_flag’ column in df_in.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame that will be updated.

  • mask (pandas.Series) – Row conditional mask to limit rows.

  • flag (str) – Text to populate the new flag with.

Returns:

df_out – Updated copy of df_in.

Return type:

pandas.DataFrame

Examples

Build pandas DataFrame to use as input:

>>> from pandas import DataFrame
>>> df = DataFrame({'CharacteristicName': ['Carbon', 'Phosphorus', 'Carbon',],
...                 'ResultMeasureValue': ['1.0', '0.265', '2.1'],})
>>> df
  CharacteristicName ResultMeasureValue
0             Carbon                1.0
1         Phosphorus              0.265
2             Carbon                2.1

Assign simple flag string and mask to assign flag only to Carbon:

>>> flag = 'words'
>>> mask = df['CharacteristicName']=='Carbon'
>>> from harmonize_wq import clean
>>> clean.add_qa_flag(df, mask, flag)
  CharacteristicName ResultMeasureValue QA_flag
0             Carbon                1.0   words
1         Phosphorus              0.265     NaN
2             Carbon                2.1   words
harmonize_wq.clean.check_precision(df_in, col, limit=3)

Add QA_flag if value in column has precision lower than limit.

Notes

Be cautious of float type and real vs representable precision.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame with the required ‘ResultDepthHeight’ columns.

  • unit_col (str) – Desired column in df_in.

  • limit (int, optional) – Number of decimal places under which to detect. The default is 3.

Returns:

df_out – DataFrame with the quality assurance flag for precision.

Return type:

pandas.DataFrame

harmonize_wq.clean.datetime(df_in)

Format time using dataretrieval and ‘ActivityStart’ columns.

Parameters:

df_in (pandas.DataFrame) – DataFrame with the expected activity date, time and timezone columns.

Returns:

df_out – DataFrame with the converted datetime column.

Return type:

pandas.DataFrame

Examples

Build pandas DataFrame for example:

>>> from pandas import DataFrame
>>> from numpy import nan
>>> df = DataFrame({'ActivityStartDate': ['2004-09-01', '2004-07-01',],
...                 'ActivityStartTime/Time': ['10:01:00', nan,],
...                 'ActivityStartTime/TimeZoneCode':  ['EST', nan],
...                 })
>>> df
  ActivityStartDate ActivityStartTime/Time ActivityStartTime/TimeZoneCode
0        2004-09-01               10:01:00                            EST
1        2004-07-01                    NaN                            NaN
>>> from harmonize_wq import clean
>>> clean.datetime(df)
  ActivityStartDate  ...         Activity_datetime
0        2004-09-01  ... 2004-09-01 15:01:00+00:00
1        2004-07-01  ...                       NaT

[2 rows x 4 columns]
harmonize_wq.clean.df_checks(df_in, columns=None)

Check pandas.DataFrame for columns.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame that will be checked.

  • columns (list, optional) – List of strings for column names. Default None, uses: ‘ResultMeasure/MeasureUnitCode’,’ResultMeasureValue’,’CharacteristicName’.

Examples

Build pandas DataFrame for example:

>>> from pandas import DataFrame
>>> df = DataFrame({'CharacteristicName': ['Phosphorus'],})
>>> df
  CharacteristicName
0         Phosphorus

Check for existing column:

>>> from harmonize_wq import clean
>>> clean.df_checks(df, columns=['CharacteristicName'])

If column is not in DataFrame it throws an AssertionError:

>>> clean.df_checks(df, columns=['ResultMeasureValue'])
Traceback (most recent call last):
    ...
AssertionError: ResultMeasureValue not in DataFrame
harmonize_wq.clean.harmonize_depth(df_in, units='meter')

Create ‘Depth’ column with result depth values in consistent units.

New column combines values from the ‘ResultDepthHeightMeasure/MeasureValue’ column with units from the ‘ResultDepthHeightMeasure/MeasureUnitCode’ column.

Notes

Currently unit registry (ureg) updates or errors are not passed back. In the future activity depth columns may be considered if result depth missing.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame with the required ‘ResultDepthHeight’ columns.

  • units (str, optional) – Desired units. The default is ‘meter’.

Returns:

df_out – DataFrame with new Depth column replacing ‘ResultDepthHeight’ columns.

Return type:

pandas.DataFrame

Examples

Build pandas DataFrame for example:

>>> from pandas import DataFrame
>>> from numpy import nan
>>> df = DataFrame({'ResultDepthHeightMeasure/MeasureValue': ['3.0', nan, 10],
...                 'ResultDepthHeightMeasure/MeasureUnitCode': ['m', nan, 'ft'],
...                 })
>>> df
  ResultDepthHeightMeasure/MeasureValue ResultDepthHeightMeasure/MeasureUnitCode
0                                   3.0                                        m
1                                   NaN                                      NaN
2                                    10                                       ft

Get clean ‘Depth’ column:

>>> from harmonize_wq import clean
>>> clean.harmonize_depth(df)[['ResultDepthHeightMeasure/MeasureValue',
...                            'Depth']]
  ResultDepthHeightMeasure/MeasureValue                     Depth
0                                   3.0                 3.0 meter
1                                   NaN                       NaN
2                                    10  3.0479999999999996 meter
harmonize_wq.clean.methods_check(df_in, char_val, methods=None)

Check methods against list of accepted methods.

Notes

This is not fully implemented.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame that will be updated.

  • char_val (str) – Characteristic name.

  • methods (dict, optional) – Dictionary where key is characteristic column name and value is list of dictionaries each with Source and Method keys. This allows updated methods dictionaries to be used. The default None uses the built-in domains.accepted_methods().

Returns:

accept – List of values from ‘ResultAnalyticalMethod/MethodIdentifier’ column in methods.

Return type:

list

harmonize_wq.clean.wet_dry_checks(df_in, mask=None)

Fix suspected errors in ‘ActivityMediaName’ column.

Uses the ‘ResultWeightBasisText’ and ‘ResultSampleFractionText’ columns to switch if the media is wet/dry where appropriate.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame that will be updated.

  • mask (pandas.Series) – Row conditional (bool) mask to limit df rows to check/fix. The default is None.

Returns:

df_out – Updated DataFrame.

Return type:

pandas.DataFrame

harmonize_wq.clean.wet_dry_drop(df_in, wet_dry='wet', char_val=None)

Restrict to only water or only sediment samples.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame that will be updated.

  • wet_dry (str, optional) – Which values (Water/Sediment) to keep. The default is ‘wet’ (Water).

  • char_val (str, optional) – Apply to specific characteristic name. The default is None (for all).

Returns:

df2 – Updated copy of df_in.

Return type:

pandas.DataFrame

harmonize_wq.convert module

Functions to convert from one unit to another, at times using pint decorators.

Contains several unit conversion functions not in pint.

harmonize_wq.convert.DO_concentration(val, pressure=<Quantity(1, 'standard_atmosphere')>, temperature=<Quantity(25, 'degree_Celsius')>)

Convert Dissolved Oxygen (DO) from concentration (mg/l) to saturation (%).

Parameters:
  • val (pint.Quantity.build_quantity_class) – The DO value (converted to mg/L).

  • pressure (pint.Quantity, optional) – The pressure value. The default is 1*ureg(“atm”).

  • temperature (pint.Quantity, optional) – The temperature value. The default is ureg.Quantity(25, ureg(“degC”)).

Returns:

Dissolved Oxygen (DO) as saturation (dimensionless).

Return type:

float

Examples

Build units aware pint Quantity, as string:

>>> input_DO = '578 mg/l'
>>> from harmonize_wq import convert
>>> convert.DO_concentration(input_DO)
6995.603308586222
harmonize_wq.convert.DO_saturation(val, pressure=<Quantity(1, 'standard_atmosphere')>, temperature=<Quantity(25, 'degree_Celsius')>)

Convert Dissolved Oxygen (DO) from saturation (%) to concentration (mg/l).

Defaults assume STP where pressure is 1 atmosphere and temperature 25C.

Parameters:
  • val (pint.Quantity.build_quantity_class) – The DO saturation value in dimensionless percent.

  • pressure (pint.Quantity, optional) – The pressure value. The default is 1*ureg(“atm”).

  • temperature (pint.Quantity, optional) – The temperature value. The default is ureg.Quantity(25, ureg(“degC”)).

Returns:

DO value in mg/l.

Return type:

pint.Quantity

Examples

>>> from harmonize_wq import convert
>>> convert.DO_saturation(70)
<Quantity(5.78363269, 'milligram / liter')>

At 2 atm (10m depth) >>> convert.DO_saturation(70, (‘2 standard_atmosphere’)) 11.746159340060716 milligram / liter

harmonize_wq.convert.FNU_to_NTU(val)

Convert turbidity units from FNU (Formazin Nephelometric Units) to NTU.

Parameters:

val (float) – The turbidity magnitude (FNU is dimensionless).

Returns:

NTU – The turbidity magnitude (NTU is dimensionless).

Return type:

float

Examples

Convert to NTU:

>>> from harmonize_wq import convert
>>> convert.FNU_to_NTU(8)
10.136
harmonize_wq.convert.JTU_to_NTU(val)

Convert turbidity units from JTU (Jackson Turbidity Units) to NTU.

Notes

This is based on linear relationship: 1 -> 19, 0.053 -> 1, 0.4 -> 7.5

Parameters:

val (pint.Quantity) – The turbidity value in JTU (dimensionless).

Returns:

NTU – The turbidity value in dimensionless NTU.

Return type:

pint.Quantity

Examples

JTU is not a standard pint unit and must be added to a unit registry first (normally done by WQCharData.update_ureg() method):

>>> import pint
>>> ureg = pint.UnitRegistry()
>>> from harmonize_wq import domains
>>> for definition in domains.registry_adds_list('Turbidity'):
...     ureg.define(definition)

Build JTU units aware pint Quantity:

>>> turbidity = ureg.Quantity('JTU')
>>> str(turbidity)
'1 Jackson_Turbidity_Units'
>>> type(turbidity)
<class 'pint.Quantity'>

Convert to NTU:

>>> from harmonize_wq import convert
>>> str(convert.JTU_to_NTU(str(turbidity)))
'18.9773 Nephelometric_Turbidity_Units'
>>> type(convert.JTU_to_NTU(str(turbidity)))
<class 'pint.Quantity'>
harmonize_wq.convert.NTU_to_cm(val)

Convert turbidity in NTU (Nephelometric Turbidity Units) to centimeters.

Parameters:

val (pint.Quantity) – The turbidity value in NTU.

Returns:

The turbidity value in centimeters.

Return type:

pint.Quantity

Examples

NTU is not a standard pint unit and must be added to a unit registry first (normally done by WQCharData.update_ureg() method):

>>> import pint
>>> ureg = pint.UnitRegistry()
>>> from harmonize_wq import domains
>>> for definition in domains.registry_adds_list('Turbidity'):
...     ureg.define(definition)

Build NTU aware pint pint Quantity:

>>> turbidity = ureg.Quantity('NTU')
>>> str(turbidity)
'1 Nephelometric_Turbidity_Units'
>>> type(turbidity)
<class 'pint.Quantity'>

Convert to cm:

>>> from harmonize_wq import convert
>>> str(convert.NTU_to_cm('1 NTU'))
'241.27 centimeter'
>>> type(convert.NTU_to_cm('1 NTU'))
<class 'pint.Quantity'>
harmonize_wq.convert.PSU_to_density(val, pressure=<Quantity(1, 'standard_atmosphere')>, temperature=<Quantity(25, 'degree_Celsius')>)

Convert salinity as Practical Salinity Units (PSU) to density.

Dimensionality changes from dimensionless Practical Salinity Units (PSU) to mass/volume density.

Parameters:
  • val (pint.Quantity) – The salinity value in dimensionless PSU.

  • pressure (pint.Quantity, optional) – The pressure value. The default is 1*ureg(“atm”).

  • temperature (pint.Quantity, optional) – The temperature value. The default is ureg.Quantity(25, ureg(“degC”)).

Returns:

density – The salinity value in density units (mg/ml).

Return type:

pint.Quantity.build_quantity_class

Examples

PSU is not a standard pint unit and must be added to a unit registry first. This can be done using the WQCharData.update_ureg method:

>>> import pint
>>> ureg = pint.UnitRegistry()
>>> from harmonize_wq import domains
>>> for definition in domains.registry_adds_list('Salinity'):
...     ureg.define(definition)

Build units aware pint Quantity, as string because it is an altered unit registry:

>>> unit = ureg.Quantity('PSU')
>>> unit
<Quantity(1, 'Practical_Salinity_Units')>
>>> type(unit)
<class 'pint.Quantity'>
>>> input_psu = str(8*unit)
>>> input_psu
'8 Practical_Salinity_Units'

Convert to density:

>>> from harmonize_wq import convert
>>> str(convert.PSU_to_density(input_psu))
'997.0540284772519 milligram / milliliter'
harmonize_wq.convert.SiO2_to_NTU(val)

Convert turbidity units from SiO2 (silicon dioxide) to NTU.

Notes

This is based on a linear relationship: 0.13 -> 1, 1 -> 7.5, 2.5 -> 19

Parameters:

val (pint.Quantity.build_quantity_class) – The turbidity value in SiO2 units (dimensionless).

Returns:

NTU – The turbidity value in dimensionless NTU.

Return type:

pint.Quantity.build_quantity_class

Examples

SiO2 is not a standard pint unit and must be added to a unit registry first (normally done using WQCharData.update_ureg() method):

>>> import pint
>>> ureg = pint.UnitRegistry()
>>> from harmonize_wq import domains
>>> for definition in domains.registry_adds_list('Turbidity'):
...     ureg.define(definition)

Build SiO2 units aware pint Quantity:

>>> turbidity = ureg.Quantity('SiO2')
>>> str(turbidity)
'1 SiO2'
>>> type(turbidity)
<class 'pint.Quantity'>

Convert to NTU:

>>> from harmonize_wq import convert
>>> str(convert.SiO2_to_NTU(str(turbidity)))
'7.5701 Nephelometric_Turbidity_Units'
>>> type(convert.SiO2_to_NTU(str(turbidity)))
<class 'pint.Quantity'>
harmonize_wq.convert.cm_to_NTU(val)

Convert turbidity measured in centimeters to NTU.

Parameters:

val (pint.Quantity) – The turbidity value in centimeters.

Returns:

The turbidity value in NTU.

Return type:

pint.Quantity

Examples

Build standard pint unit registry:

>>> import pint
>>> ureg = pint.UnitRegistry()

Build cm units aware pint Quantity (already in standard unit registry):

>>> turbidity = ureg.Quantity('cm')
>>> str(turbidity)
'1 centimeter'
>>> type(turbidity)
<class 'pint.Quantity'>

Convert to cm:

>>> from harmonize_wq import convert
>>> str(convert.cm_to_NTU(str(turbidity)))
'3941.8 Nephelometric_Turbidity_Units'
>>> type(convert.cm_to_NTU(str(turbidity)))
<class 'pint.Quantity'>
harmonize_wq.convert.conductivity_to_PSU(val, pressure=<Quantity(0, 'standard_atmosphere')>, temperature=<Quantity(25, 'degree_Celsius')>)

Estimate salinity (PSU) from conductivity.

Parameters:
  • val (pint.Quantity.build_quantity_class) – The conductivity value (converted to microsiemens / centimeter).

  • pressure (pint.Quantity, optional) – The pressure value. The default is 0*ureg(“atm”).

  • temperature (pint.Quantity, optional) – The temperature value. The default is ureg.Quantity(25, ureg(“degC”)).

Returns:

Estimated salinity (PSU).

Return type:

pint.Quantity

Notes

Conductivity to salinity conversion PSS 1978 method. c-numeric conductivity in uS (microsiemens). t-numeric Celsius temperature (defaults to 25). P-numeric optional pressure (defaults to 0).

References

IOC, SCOR and IAPSO, 2010: The international thermodynamic equation of seawater – 2010: Calculation and use of thermodynamic properties. Intergovernmental Oceanographic Commission, Manuals and Guides No. 56, UNESCO (English), 196 pp.

Alan D. Jassby and James E. Cloern (2015). wq: Some tools for exploring water quality monitoring data. R package v0.4.4. See the ec2pss function.

Adapted from R cond2sal_shiny

Examples

PSU (Practical Salinity Units) is not a standard pint unit and must be added to a unit registry first:

>>> import pint
>>> ureg = pint.UnitRegistry()
>>> from harmonize_wq import domains
>>> for definition in domains.registry_adds_list('Salinity'):
...     ureg.define(definition)

Build units aware pint Quantity, as string:

>>> input_conductivity = '111.0 uS/cm'

Convert to Practical Salinity Units:

>>> from harmonize_wq import convert
>>> convert.conductivity_to_PSU(input_conductivity)
<Quantity(0.057, 'dimensionless')>
harmonize_wq.convert.convert_unit_series(quantity_series, unit_series, units, ureg=None, errors='raise')

Convert quantities to consistent units.

Convert list of quantities (quantity_list), each with a specified old unit, to a quantity in units using pint constructor method.

Parameters:
  • quantity_series (pandas.Series) – List of quantities. Values should be numeric, must not include NaN.

  • unit_series (pandas.Series) – List of units for each quantity in quantity_series. Values should be string, must not include NaN.

  • units (str) – Desired units.

  • ureg (pint.UnitRegistry, optional) – Unit Registry Object with any custom units defined. The default is None.

  • errors (str, optional) – Values of ‘ignore’, ‘raise’, or ‘skip’. The default is ‘raise’. If ‘raise’, invalid dimension conversions will raise an exception. If ‘skip’, invalid dimension conversions will not be converted. If ‘ignore’, invalid dimension conversions will return the NaN.

Returns:

Converted values from quantity_series in units with original index.

Return type:

pandas.Series

Examples

Build series to use as input:

>>> from pandas import Series
>>> quantity_series = Series([1, 10])
>>> unit_series = Series(['mg/l', 'mg/ml',])

Convert series to series of pint Quantity objects in ‘mg/l’:

>>> from harmonize_wq import convert
>>> convert.convert_unit_series(quantity_series, unit_series, units = 'mg/l')
0                   1.0 milligram / liter
1    10000.000000000002 milligram / liter
dtype: object
harmonize_wq.convert.density_to_PSU(val, pressure=<Quantity(1, 'standard_atmosphere')>, temperature=<Quantity(25, 'degree_Celsius')>)

Convert salinity as density (mass/volume) to Practical Salinity Units.

Parameters:
  • val (pint.Quantity.build_quantity_class) – The salinity value in density units.

  • pressure (pint.Quantity.build_quantity_class, optional) – The pressure value. The default is 1*ureg(“atm”).

  • temperature (pint.Quantity.build_quantity_class, optional) – The temperature value. The default is ureg.Quantity(25, ureg(“degC”)).

Returns:

PSU – The salinity value in dimensionless PSU.

Return type:

pint.Quantity.build_quantity_class

Examples

PSU (Practical Salinity Units) is not a standard pint unit and must be added to a unit registry first (normally done by WQCharData.update_ureg() method):

>>> import pint
>>> ureg = pint.UnitRegistry()
>>> from harmonize_wq import domains
>>> for definition in domains.registry_adds_list('Salinity'):
...     ureg.define(definition)

Build units aware pint Quantity, as string:

>>> input_density = '1000 milligram / milliliter'

Convert to Practical Salinity Units:

>>> from harmonize_wq import convert
>>> convert.density_to_PSU(input_density)
<Quantity(4.71542857, 'gram / kilogram')>
harmonize_wq.convert.mass_to_moles(ureg, char_val, Q_)

Convert a mass to moles substance.

Parameters:
  • ureg (pint.UnitRegistry) – Unit Registry Object with any custom units defined.

  • char_val (str) – Characteristic name to use to find corresponding molecular weight.

  • Q (pint.Quantity) – Mass to convert to moles.

Returns:

Value in moles of substance.

Return type:

pint.Quantity

Examples

Build standard pint unit registry:

>>> import pint
>>> ureg = pint.UnitRegistry()

Build pint quantity:

>>> Q_ = 1 * ureg('g')
>>> from harmonize_wq import convert
>>> str(convert.mass_to_moles(ureg, 'Phosphorus', Q_))
'0.03228931223764934 mole'
harmonize_wq.convert.moles_to_mass(ureg, Q_, basis=None, char_val=None)

Convert moles substance to mass.

Either basis or char_val must have a non-default value.

Parameters:
  • ureg (pint.UnitRegistry) – Unit Registry Object with any custom units defined.

  • Q (ureg.Quantity) – Quantity (measure and units).

  • basis (str, optional) – Speciation (basis) of measure to determine molecular weight. Default is None.

  • char_val (str, optional) – Characteristic Name to use when converting moles substance to mass. Default is None.

Returns:

Value in mass (g).

Return type:

pint.Quantity

Examples

Build standard pint unit registry:

>>> import pint
>>> ureg = pint.UnitRegistry()

Build quantity:

>>> Q_ = 0.265 * ureg('umol')
>>> from harmonize_wq import convert
>>> str(convert.moles_to_mass(ureg, Q_, basis='as P'))
'8.20705e-06 gram'

harmonize_wq.domains module

Functions to return domain lists with all potential values.

These are mainly for use as filters. Small or frequently utilized domains may be hard-coded. A URL based method can be used to get the most up to date domain list.

harmonize_wq.domains.accepted_methods

Get accepted methods for each characteristic. Dictionary where key is characteristic column name and value is list of dictionaries each with Source and Method keys.

Notes

Source should be in ‘ResultAnalyticalMethod/MethodIdentifierContext’ column. This is not fully implemented.

Type:

dict

harmonize_wq.domains.stations_rename

Get shortened column names for shapefile (.shp) fields.

Dictionary where key = WQP field name and value = short name for .shp.

ESRI places a length restriction on shapefile (.shp) field names. This returns a dictionary with the original water quality portal field name (as key) and shortened column name for writing as .shp. We suggest using the longer original name as the field alias when writing as .shp.

Examples

Although running the function returns the full dictionary of Key:Value pairs, here we show how the current name can be used as a key to get the new name:

>>> domains.stations_rename['OrganizationIdentifier']
'org_ID'
Type:

dict

harmonize_wq.domains.xy_datum
Get dictionary of expected horizontal datums, where exhaustive:

{HorizontalCoordinateReferenceSystemDatumName: {Description:str, EPSG:int}}

The structure has {key as expected string: value as {“Description”: string and “EPSG”: integer (4-digit code)}.

Notes

source WQP: HorizontalCoordinateReferenceSystemDatum_CSV.zip

Anything not in dict will be NaN, and non-integer EPSG will be missing: “OTHER”: {“Description”: ‘Other’, “EPSG”: nan}, “UNKWN”: {“Description”: ‘Unknown’, “EPSG”: nan}

Examples

Running the function returns the full dictionary with {abbreviation: {‘Description’:values, ‘EPSG’:values}}. The abbreviation key can be used to get the EPSG code:

>>> domains.xy_datum['NAD83']
{'Description': 'North American Datum 1983', 'EPSG': 4269}
>>> domains.xy_datum['NAD83']['EPSG']
4269
Type:

dict

harmonize_wq.domains.char_tbl_TADA(df, char)

Get structured dictionary for TADA.CharacteristicName from TADA df.

Parameters:
  • df (pandas.DataFrame) – Table from TADA for specific characteristic.

  • char (str) – CharacteristicName.

Returns:

new_char_dict

Returned dictionary follows general structure:
{
“Target.TADA.CharacteristicName”: {
“Target.TADA.ResultSampleFractionText”: [

“Target.TADA.ResultSampleFractionText”

]

}

}

Return type:

dict

harmonize_wq.domains.characteristic_cols(category=None)

Get characteristic specific columns list, can subset those by category.

Parameters:

category (str, optional) – Subset results: ‘Basis’, ‘Bio’, ‘Depth’, ‘QA’, ‘activity’, ‘analysis’, ‘depth’, ‘measure’, ‘sample’. The default is None.

Returns:

col_list – List of columns.

Return type:

list

Examples

Running the function without a category returns the full list of column names, including a category returns only the columns in that category:

>>> domains.characteristic_cols('QA')  
['ResultDetectionConditionText', 'ResultStatusIdentifier', 'PrecisionValue',
 'DataQuality/BiasValue', 'ConfidenceIntervalValue', 'UpperConfidenceLimitValue',
 'LowerConfidenceLimitValue', 'ResultCommentText', 'ResultSamplingPointName',
 'ResultDetectionQuantitationLimitUrl']
harmonize_wq.domains.get_domain_dict(table, cols=None)

Get domain values for specified table.

Parameters:
  • table (str) – csv table name (without extension).

  • cols (list, optional) – Columns to use as {key, value}. The default is None, [‘Name’, ‘Description’].

Returns:

Dictionary where {cols[0]: cols[1]}

Return type:

dict

Examples

Return dictionary for domain from WQP table (e.g., ‘ResultSampleFraction’), The default keys (‘Name’) are shown as values (‘Description’) are long:

>>> from harmonize_wq import domains
>>> domains.get_domain_dict('ResultSampleFraction').keys() 
dict_keys(['Acid Soluble', 'Bed Sediment', 'Bedload', 'Bioavailable', 'Comb Available',
           'Dissolved', 'Extractable', 'Extractable, CaCO3-bound', 'Extractable, exchangeable',
           'Extractable, organic-bnd', 'Extractable, other', 'Extractable, oxide-bound',
           'Extractable, residual', 'Field***', 'Filter/sieve residue', 'Filterable',
           'Filtered field and/or lab', 'Filtered, field', 'Filtered, lab',
           'Fixed', 'Free Available', 'Inorganic', 'Leachable', 'Net (Hot)',
           'Non-Filterable (Particle)', 'Non-settleable', 'Non-volatile',
           'None', 'Organic', 'Pot. Dissolved', 'Semivolatile', 'Settleable',
           'Sieved', 'Strong Acid Diss', 'Supernate', 'Suspended', 'Total',
           'Total Recoverable', 'Total Residual', 'Total Soluble',
           'Unfiltered', 'Unfiltered, field', 'Vapor', 'Volatile',
           'Weak Acid Diss', 'Yield', 'non-linear function'])
harmonize_wq.domains.harmonize_TADA_dict()

Get structured dictionary from TADA HarmonizationTemplate csv.

Based on target column names and sample fractions.

Returns:

full_dict

{‘TADA.CharacteristicName’:
{Target.TADA.CharacteristicName:
{Target.TADA.ResultSampleFractionText :

[Target.TADA.ResultSampleFractionText]}}}

Return type:

dict

harmonize_wq.domains.re_case(word, domain_list)

Change instance of word in domain_list to UPPERCASE.

Parameters:
  • word (str) – Word to alter in domain_list.

  • domain_list (list) – List including word.

Returns:

Word from domain_list in UPPERCASE.

Return type:

str

harmonize_wq.domains.registry_adds_list(out_col)

Get units to add to pint unit registry by out_col column.

Typically out_col refers back to column used for a value from the ‘CharacteristicName’ column.

Parameters:

out_col (str) – The result column a unit registry is being built for.

Returns:

List of strings with unit additions in expected format.

Return type:

list

Examples

Generate a new pint unit registry object for e.g., Sediment:

>>> from harmonize_wq import domains
>>> domains.registry_adds_list('Sediment')  
['fraction = [] = frac',
 'percent = 1e-2 frac',
 'parts_per_thousand = 1e-3 = ppth',
 'parts_per_million = 1e-6 fraction = ppm']

harmonize_wq.harmonize module

Functions to harmonize data retrieved from EPA’s Water Quality Portal.

harmonize_wq.harmonize.dissolved_oxygen(wqp)

Standardize ‘Dissolved Oxygen (DO)’ characteristic.

Uses wq_data.WQCharData to check units, check unit dimensionality and perform appropriate unit conversions.

Parameters:

wqp (wq_data.WQCharData) – WQP Characteristic Info Object to check units, check unit dimensionality and perform appropriate unit conversions.

Returns:

wqp – WQP Characteristic Info Object with updated attributes.

Return type:

wq_data.WQCharData

harmonize_wq.harmonize.harmonize(df_in, char_val, units_out=None, errors='raise', intermediate_columns=False, report=False)

Harmonize char_val rows based methods specific to that char_val.

All rows where the value in the ‘CharacteristicName’ column matches char_val will have their results harmonized based on available methods for that char_val.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame with the expected columns (change based on char_val).

  • char_val (str) – Target value in ‘CharacteristicName’ column.

  • units_out (str, optional) – Desired units to convert results into. The default None, uses the constant domains.OUT_UNITS.

  • errors (str, optional) – Values of ‘ignore’, ‘raise’, or ‘skip’. The default is ‘raise’. If ‘raise’, then invalid dimension conversions will raise an exception. If ‘skip’, then invalid dimension conversions will not be converted. If ‘ignore’, then invalid dimension conversions will return the NaN.

  • intermediate_columns (Boolean, optional) – Return intermediate columns. Default ‘False’ does not return these.

  • report (bool, optional) – Print a change summary report. The default is False.

Returns:

df – Updated copy of df_in.

Return type:

pandas.DataFrame

Examples

Build example df_in table from harmonize_wq tests to use in place of Water Quality Portal query response, this table has ‘Temperature, water’ and ‘Phosphorous’ results:

>>> import pandas
>>> tests_url = 'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests'
>>> df1 = pandas.read_csv(tests_url + '/data/wqp_results.txt')
>>> df1.shape
(359505, 35)
>>> from harmonize_wq import harmonize
>>> df_result = harmonize.harmonize(df1, 'Temperature, water')
>>> df_result
       OrganizationIdentifier  ...           Temperature
0                21FLHILL_WQX  ...  29.93 degree_Celsius
1                21FLHILL_WQX  ...  17.82 degree_Celsius
2                  21FLGW_WQX  ...  22.42 degree_Celsius
3                21FLMANA_WQX  ...   30.0 degree_Celsius
4                21FLHILL_WQX  ...  30.37 degree_Celsius
...                       ...  ...                   ...
359500           21FLHILL_WQX  ...  28.75 degree_Celsius
359501           21FLHILL_WQX  ...  23.01 degree_Celsius
359502            21FLTBW_WQX  ...  29.97 degree_Celsius
359503           21FLPDEM_WQX  ...  32.01 degree_Celsius
359504           21FLSMRC_WQX  ...                   NaN

[359505 rows x 37 columns]

List columns that were added:

>>> df_result.columns[-2:]
Index(['QA_flag', 'Temperature'], dtype='object')

See also

See any of the ‘Detailed’ notebooks found in ‘demos<https://github.com/USEPA/harmonize-wq/tree/main/demos>’ for examples of how this function is used to standardize, clean, and wrangle a Water Quality Portal query response, one ‘CharacteristicName’ value at a time.

harmonize_wq.harmonize.harmonize_all(df_in, errors='raise')

Harmonizes all ‘CharacteristicNames’ column values with methods.

All results are standardized to default units. Intermediate columns are not retained. See domains.out_col_lookup() for list of values with methods.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame with the expected columns (changes based on values in ‘CharacteristicNames’ column).

  • errors (str, optional) – Values of ‘ignore’, ‘raise’, or ‘skip’. The default is ‘raise’. If ‘raise’, invalid dimension conversions will raise an exception. If ‘skip’, invalid dimension conversions will not be converted. If ‘ignore’, invalid dimension conversions will return the NaN.

Returns:

df – Updated copy of df_in.

Return type:

pandas.DataFrame

Examples

Build example df_in table from harmonize_wq tests to use in place of Water Quality Portal query response, this table has ‘Temperature, water’ and ‘Phosphorous’ results:

>>> import pandas
>>> tests_url = 'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests'
>>> df1 = pandas.read_csv(tests_url + '/data/wqp_results.txt')
>>> df1.shape
(359505, 35)

When running the function there may be read outs or warnings, as things are encountered such as unexpected nutrient sample fractions:

>>> from harmonize_wq import harmonize
>>> df_result_all = harmonize.harmonize_all(df1)
1 Phosphorus sample fractions not in frac_dict
1 Phosphorus sample fractions not in frac_dict found in expected domains, mapped to "Other_Phosphorus"
>>> df_result_all
       OrganizationIdentifier  ...           Temperature
0                21FLHILL_WQX  ...  29.93 degree_Celsius
1                21FLHILL_WQX  ...  17.82 degree_Celsius
2                  21FLGW_WQX  ...  22.42 degree_Celsius
3                21FLMANA_WQX  ...   30.0 degree_Celsius
4                21FLHILL_WQX  ...  30.37 degree_Celsius
...                       ...  ...                   ...
359500           21FLHILL_WQX  ...  28.75 degree_Celsius
359501           21FLHILL_WQX  ...  23.01 degree_Celsius
359502            21FLTBW_WQX  ...  29.97 degree_Celsius
359503           21FLPDEM_WQX  ...  32.01 degree_Celsius
359504           21FLSMRC_WQX  ...                   NaN

[359505 rows x 42 columns]

List columns that were added:

>>> sorted(list(df_result_all.columns[-7:]))
... 
['Other_Phosphorus', 'Phosphorus', 'QA_flag', 'Speciation',
 'TDP_Phosphorus', 'TP_Phosphorus', 'Temperature']

See also

See any of the ‘Simple’ notebooks found in ‘demos<https://github.com/USEPA/harmonize-wq/tree/main/demos>’ for examples of how this function is used to standardize, clean, and wrangle a Water Quality Portal query response.

harmonize_wq.harmonize.salinity(wqp)

Standardize ‘Salinity’ characteristic.

Uses wq_data.WQCharData to check basis, check units, check unit dimensionality and perform appropriate unit conversions.

Notes

PSU=PSS=ppth and ‘ppt’ is picopint in pint so it is changed to ‘ppth’.

Parameters:

wqp (wq_data.WQCharData) – WQP Characteristic Info Object.

Returns:

wqp – WQP Characteristic Info Object with updated attributes.

Return type:

wq_data.WQCharData

harmonize_wq.harmonize.sediment(wqp)

Standardize ‘Sediment’ characteristic.

Uses wq_data.WQCharData to check basis, check units, and check unit dimensionality.

Parameters:

wqp (wq_data.WQCharData) – WQP Characteristic Info Object.

Returns:

wqp – WQP Characteristic Info Object with updated attributes.

Return type:

wq_data.WQCharData

harmonize_wq.harmonize.turbidity(wqp)

Standardize ‘Turbidity’ characteristic.

Uses wq_data.WQCharData to check units, check unit dimensionality and perform appropriate unit conversions

Notes

See USGS Report Chapter A6. Section 6.7. Turbidity See ASTM DÍ-17 for equivalent unit definitions: ‘NTU’ - 400-680nm (EPA 180.1), range 0.0-40. ‘NTRU’ - 400-680nm (2130B), range 0-10,000. ‘NTMU’ - 400-680nm. ‘FNU’ - 780-900nm (ISO 7027), range 0-1000. ‘FNRU’ - 780-900nm (ISO 7027), range 0-10,000. ‘FAU’ - 780-900nm, range 20-1000. Older methods: ‘FTU’ - lacks instrumentation specificity ‘SiO2’ (ppm or mg/l) - concentration of calibration standard (=JTU) ‘JTU’ - candle instead of formazin standard, near 40 NTU these may be equivalent, but highly variable. Conversions used: cm <-> NTU see convert.cm_to_NTU() from USU.

Alternative conversions available but not currently used by default: convert.FNU_to_NTU() from Gohin (2011) Ocean Sci., 7, 705–732 https://doi.org/10.5194/os-7-705-2011. convert.SiO2_to_NTU() linear relation from Otilia et al. 2013. convert.JTU_to_NTU() linear relation from Otilia et al. 2013.

Otilia, Rusănescu Carmen, Rusănescu Marin, and Stoica Dorel. MONITORING OF PHYSICAL INDICATORS IN WATER SAMPLES. https://hidraulica.fluidas.ro/2013/nr_2/84_89.pdf.

Parameters:

wqp (wq_data.WQCharData) – WQP Characteristic Info Object.

Returns:

wqp – WQP Characteristic Info Object with updated attributes.

Return type:

wq_data.WQCharData

harmonize_wq.location module

Functions to clean/correct location data.

harmonize_wq.location.get_harmonized_stations(query, aoi=None)

Query, harmonize and clip stations.

Queries the Water Quality Portal for stations with data matching the query, harmonizes those stations’ location information, and clips it to the area of interest (aoi) if specified.

See www.waterqualitydata.us/webservices_documentation for API reference.

Parameters:
  • query (dict) – Water Quality Portal query as dictionary.

  • aoi (geopandas.GeoDataFrame, optional) – Area of interest to clip stations to. The default None returns all stations in the query extent.

Returns:

  • stations_gdf (geopandas.GeoDataFrame) – Harmonized stations.

  • stations (pandas.DataFrame) – Raw station results from WQP.

  • site_md (dataretrieval.utils.Metadata) – Custom dataretrieval metadata object pertaining to the WQP query.

Examples

See any of the ‘Simple’ notebooks found in ‘demos<https://github.com/USEPA/harmonize-wq/tree/main/demos>’_ for examples of how this function is used to query and harmonize stations.

harmonize_wq.location.harmonize_locations(df_in, out_EPSG=4326, intermediate_columns=False, **kwargs)

Create harmonized geopandas GeoDataframe from pandas DataFrame.

Takes a DataFrame with lat/lon in multiple Coordinate Reference Systems (CRS), transforms them to out_EPSG CRS, and converts to geopandas.GeoDataFrame. A ‘QA_flag’ column is added to the result and populated for any row that has location based problems like limited decimal precision or an unknown input CRS.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame with the required columns (see kwargs for expected defaults) to be converted to GeoDataFrame.

  • out_EPSG (int, optional) – EPSG factory code for desired output Coordinate Reference System datum. The default is 4326, for the WGS84 Datum used by WQP queries.

  • intermediate_columns (Boolean, optional) – Return intermediate columns. Default ‘False’ does not return these.

  • **kwargs (optional) – Accepts crs_col, lat_col, and lon_col parameters if non-default:

  • crs_col (str, optional) – Name of column in DataFrame with the Coordinate Reference System datum. The default is ‘HorizontalCoordinateReferenceSystemDatumName’.

  • lat_col (str, optional) – Name of column in DataFrame with the latitude coordinate. The default is ‘LatitudeMeasure’.

  • lon_col (str, optional) – Name of column in DataFrame with the longitude coordinate. The default is ‘LongitudeMeasure’.

Returns:

gdf – GeoDataFrame of df_in with coordinates in out_EPSG datum.

Return type:

geopandas.GeoDataFrame

Examples

Build pandas DataFrame to use in example:

>>> df_in = pandas.DataFrame(
...     {
...         "LatitudeMeasure": [27.5950355, 27.52183, 28.0661111],
...         "LongitudeMeasure": [-82.0300865, -82.64476, -82.3775],
...         "HorizontalCoordinateReferenceSystemDatumName":
...             ["NAD83", "WGS84", "NAD27"],
...     }
... )
>>> df_in
   LatitudeMeasure  ...  HorizontalCoordinateReferenceSystemDatumName
0        27.595036  ...                                         NAD83
1        27.521830  ...                                         WGS84
2        28.066111  ...                                         NAD27

[3 rows x 3 columns]
>>> from harmonize_wq import location
>>> location.harmonize_locations(df_in)
   LatitudeMeasure  LongitudeMeasure  ... QA_flag                    geometry
0        27.595036        -82.030086  ...     NaN  POINT (-82.03009 27.59504)
1        27.521830        -82.644760  ...     NaN  POINT (-82.64476 27.52183)
2        28.066111        -82.377500  ...     NaN  POINT (-82.37750 28.06611)

[3 rows x 5 columns]
harmonize_wq.location.infer_CRS(df_in, out_EPSG, out_col='EPSG', bad_crs_val=None, crs_col='HorizontalCoordinateReferenceSystemDatumName')

Replace missing or unrecognized Coordinate Reference System (CRS).

Replaces with desired CRS and notes it was missing in ‘QA_flag’ column.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame that will be updated.

  • out_EPSG (str) – Desired CRS to use.

  • out_col (str, optional) – Column in df to write out_EPSG to. The default is ‘EPSG’.

  • bad_crs_val (str, optional) – Bad Coordinate Reference System (CRS) datum name value to replace. The default is None for missing datum.

  • crs_col (str, optional) – Datum column in df_in. The default is ‘HorizontalCoordinateReferenceSystemDatumName’.

Returns:

df_out – Updated copy of df_in.

Return type:

pandas.DataFrame

Examples

Build pandas DataFrame to use in example, where crs_col name is ‘Datum’ rather than default ‘HorizontalCoordinateReferenceSystemDatumName’:

>>> from numpy import nan
>>> df_in = pandas.DataFrame({'Datum': ['NAD83', 'WGS84', '', None, nan]})
>>> df_in  
   Datum
0  NAD83
1  WGS84
2
3   None
4    NaN
>>> from harmonize_wq import location
>>> location.infer_CRS(df_in, out_EPSG=4326, crs_col='Datum')
... 
   Datum                                  QA_flag    EPSG
0  NAD83                                      NaN     NaN
1  WGS84                                      NaN     NaN
2                                             NaN     NaN
3   None  Datum: MISSING datum, EPSG:4326 assumed  4326.0
4    NaN  Datum: MISSING datum, EPSG:4326 assumed  4326.0

NOTE: missing (NaN) and bad CRS values (bad_crs_val=None) are given an EPSG and noted in QA_flag’ columns.

harmonize_wq.location.transform_vector_of_points(df_in, datum, out_EPSG)

Transform points by vector (sub-sets points by EPSG==datum).

Parameters:
  • df_in (pandas.DataFrame) – DataFrame that will be updated.

  • datum (int) – Current datum (EPSG code) to transform.

  • out_EPSG (int) – EPSG factory code for desired output Coordinate Reference System datum.

Returns:

df – Updated copy of df_in.

Return type:

pandas.DataFrame

harmonize_wq.visualize module

Functions to help visualize data.

harmonize_wq.visualize.map_counts(df_in, gdf, col=None)

Get GeoDataFrame summarized by count of results for each station.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame with subset of results.

  • gdf (geopandas.GeoDataFrame) – GeoDataFrame with monitoring locations.

  • col (str, optional) – Column in df_in to aggregate results to in addition to location. The default is None, where results are only aggregated on location.

Returns:

GeoDataFrame with count of results for each station

Return type:

geopandas.GeoDataFrame

Examples

Build example DataFrame of results:

>>> from pandas import DataFrame
>>> df_in = DataFrame({'ResultMeasureValue': [5.1, 1.2, 8.7],
...                    'MonitoringLocationIdentifier': ['ID1', 'ID2', 'ID1']
...                           })
>>> df_in
   ResultMeasureValue MonitoringLocationIdentifier
0                 5.1                          ID1
1                 1.2                          ID2
2                 8.7                          ID1

Build example GeoDataFrame of monitoring locations:

>>> import geopandas
>>> from shapely.geometry import Point
>>> from numpy import nan
>>> d = {'MonitoringLocationIdentifier': ['ID1', 'ID2'],
...      'QA_flag': [nan, nan],
...      'geometry': [Point(1, 2), Point(2, 1)]}
>>> gdf = geopandas.GeoDataFrame(d, crs="EPSG:4326")
>>> gdf
  MonitoringLocationIdentifier  QA_flag                 geometry
0                          ID1      NaN  POINT (1.00000 2.00000)
1                          ID2      NaN  POINT (2.00000 1.00000)

Combine these to get an aggregation of results per station:

>>> import harmonize_wq
>>> cnt_gdf = harmonize_wq.visualize.map_counts(df_in, gdf)
>>> cnt_gdf
  MonitoringLocationIdentifier  cnt                 geometry  QA_flag
0                          ID1    2  POINT (1.00000 2.00000)      NaN
1                          ID2    1  POINT (2.00000 1.00000)      NaN

These aggregate results can then be plotted:

>>> cnt_gdf.plot(column='cnt', cmap='Blues', legend=True)
<Axes: >
harmonize_wq.visualize.map_measure(df_in, gdf, col)

Get GeoDataFrame summarized by average of results for each station.

geopandas.GeoDataFrame will have new column ‘mean’ with the average of col values for that location.

Parameters:
Returns:

GeoDataFrame with average value of results for each station.

Return type:

geopandas.GeoDataFrame

Examples

Build array of pint Quantity for Temperature:

>>> from pint import Quantity
>>> u = 'degree_Celsius'
>>> temperatures = [Quantity(5.1, u), Quantity(1.2, u), Quantity(8.7, u)]

Build example pandas DataFrame of results:

>>> from pandas import DataFrame
>>> df_in = DataFrame({'Temperature': temperatures,
...                    'MonitoringLocationIdentifier': ['ID1', 'ID2', 'ID1']
...                    })
>>> df_in
          Temperature MonitoringLocationIdentifier
0  5.1 degree_Celsius                          ID1
1  1.2 degree_Celsius                          ID2
2  8.7 degree_Celsius                          ID1

Build example geopandas GeoDataFrame of monitoring locations:

>>> import geopandas
>>> from shapely.geometry import Point
>>> from numpy import nan
>>> d = {'MonitoringLocationIdentifier': ['ID1', 'ID2'],
...      'QA_flag': [nan, nan],
...      'geometry': [Point(1, 2), Point(2, 1)]}
>>> gdf = geopandas.GeoDataFrame(d, crs="EPSG:4326")
>>> gdf
  MonitoringLocationIdentifier  QA_flag                 geometry
0                          ID1      NaN  POINT (1.00000 2.00000)
1                          ID2      NaN  POINT (2.00000 1.00000)

Combine these to get an aggregation of results per station:

>>> from harmonize_wq import visualize
>>> avg_temp = visualize.map_measure(df_in, gdf, 'Temperature')
>>> avg_temp
  MonitoringLocationIdentifier  cnt  mean                 geometry  QA_flag
0                          ID1    2   6.9  POINT (1.00000 2.00000)      NaN
1                          ID2    1   1.2  POINT (2.00000 1.00000)      NaN

These aggregate results can then be plotted:

>>> avg_temp.plot(column='mean', cmap='Blues', legend=True)
<Axes: >
harmonize_wq.visualize.print_report(results_in, out_col, unit_col_in, threshold=None)

Print a standardized report of changes made.

Parameters:
  • results_in (pandas.DataFrame) – DataFrame with subset of results.

  • out_col (str) – Name of column in results_in with final result.

  • unit_col_in (str) – Name of column with original units.

  • threshold (dict, optional) – Dictionary with min and max keys. The default is None.

Return type:

None.

See also

See any of the ‘Detailed’ notebooks found in demos for examples of how this function is leveraged by the harmonize.harmonize_generic() report argument.

harmonize_wq.visualize.station_summary(df_in, col)

Get summary table for stations.

Summary table as DataFrame with rows for each station, count, and column average.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame with results to summarize.

  • col (str) – Column name in df_in to summarize results for.

Returns:

Table with result count and average summarized by station.

Return type:

pandas.DataFrame

harmonize_wq.wq_data module

Class for harmonizing data retrieved from EPA’s Water Quality Portal.

class harmonize_wq.wq_data.WQCharData(df_in, char_val)

Bases: object

Class for specific characteristic in Water Quality Portal results.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame that will be updated.

  • char_val (str) – Expected value in ‘CharacteristicName’ column.

df

DataFrame with results for the specific characteristic.

Type:

pandas.DataFrame

c_mask

Row conditional (bool) mask to limit df rows to only those for the specific characteristic.

Type:

pandas.Series

col

Standard WQCharData.df column names for unit_in, unit_out, and measure.

Type:

types.SimpleNamespace

out_col

Column name in df for results, set using char_val.

Type:

str

ureg

pint unit registry, initially standard unit registry.

Type:

pint.UnitRegistry

units

Units all results in out_col column will be converted into. Default units are returned from domains.OUT_UNITS() [out_col].

Type:

str

Examples

Build pandas DataFrame to use as input:

>>> from pandas import DataFrame
>>> from numpy import nan
>>> df = DataFrame({'CharacteristicName': ['Phosphorus', 'Temperature, water',],
...                 'ResultMeasure/MeasureUnitCode': [nan, nan],
...                 'ResultMeasureValue': ['1.0', '10.0',],
...                 })
>>> df
   CharacteristicName  ResultMeasure/MeasureUnitCode ResultMeasureValue
0          Phosphorus                            NaN                1.0
1  Temperature, water                            NaN               10.0
>>> from harmonize_wq import wq_data
>>> wq = wq_data.WQCharData(df, 'Phosphorus')
>>> wq.df
   CharacteristicName  ResultMeasure/MeasureUnitCode  ... Units  Phosphorus
0          Phosphorus                            NaN  ...   NaN         1.0
1  Temperature, water                            NaN  ...   NaN         NaN

[2 rows x 5 columns]
>>> wq.df.columns
Index(['CharacteristicName', 'ResultMeasure/MeasureUnitCode',
       'ResultMeasureValue', 'Units', 'Phosphorus'],
      dtype='object')
apply_conversion(convert_fun, unit, u_mask=None)

Apply special dimension changing conversions.

This uses functions in convert module and apply them across all cases of current unit.

Parameters:
  • convert_fun (function) – Conversion function to apply.

  • unit (str) – Current unit.

  • u_mask (pandas.Series, optional) – Mask to use to identify what is being converted. The default is None, creating a unit mask based on unit.

Return type:

None.

Examples

Build pandas DataFrame to use as input:

>>> from pandas import DataFrame
>>> df = DataFrame(
...   {
...     'CharacteristicName': [
...       'Dissolved oxygen (DO)',
...       'Dissolved oxygen (DO)',
...     ],
...     'ResultMeasure/MeasureUnitCode': ['mg/l', '%'],
...     'ResultMeasureValue': ['1.0', '10.0',],
...   }
... )
>>> df
      CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue
0  Dissolved oxygen (DO)                          mg/l                1.0
1  Dissolved oxygen (DO)                             %               10.0

Build WQ Characteristic Data class from pandas DataFrame:

>>> from harmonize_wq import wq_data
>>> wq = wq_data.WQCharData(df, 'Dissolved oxygen (DO)')
>>> wq.apply_conversion(convert.DO_saturation, '%')
>>> wq.df[['Units', 'DO']]
               Units        DO
0               mg/l  1.000000
1  milligram / liter  0.008262
check_basis(basis_col='MethodSpecificationName')

Determine speciation (basis) for measure.

Parameters:

basis_col (str, optional) – Basis column name. Default is ‘MethodSpecificationName’ which is replaced by ‘Speciation’. Other columns are updated in place.

Return type:

None.

Examples

Build DataFrame to use as input:

>>> from pandas import DataFrame
>>> from numpy import nan
>>> df = DataFrame(
...     {
...       "CharacteristicName": [
...         "Phosphorus",
...         "Temperature, water",
...         "Phosphorus",
...       ],
...       "ResultMeasure/MeasureUnitCode": ["mg/l as P", nan, "mg/l",],
...       "ResultMeasureValue": ["1.0", "67.0", "10",],
...       "MethodSpecificationName": [nan, nan, "as PO4",],
...     }
... )
>>> df[['ResultMeasure/MeasureUnitCode', 'MethodSpecificationName']]
  ResultMeasure/MeasureUnitCode MethodSpecificationName
0                     mg/l as P                     NaN
1                           NaN                     NaN
2                          mg/l                  as PO4

Build WQ Characteristic Data class from pandas DataFrame:

>>> from harmonize_wq import wq_data
>>> wq = wq_data.WQCharData(df, 'Phosphorus')
>>> wq.df.columns  
Index(['CharacteristicName', 'ResultMeasure/MeasureUnitCode',
       'ResultMeasureValue', 'MethodSpecificationName', 'Units', 'Phosphorus'],
      dtype='object')

Run check_basis method to speciation for phosphorus:

>>> wq.check_basis()
>>> wq.df[['MethodSpecificationName', 'Speciation']]
  MethodSpecificationName Speciation
0                     NaN          P
1                     NaN        NaN
2                  as PO4        PO4

Note where basis was part of ‘ResultMeasure/MeasureUnitCode’ it has been removed in ‘Units’:

>>> wq.df.iloc[0]
CharacteristicName               Phosphorus
ResultMeasure/MeasureUnitCode     mg/l as P
ResultMeasureValue                      1.0
MethodSpecificationName                 NaN
Units                                  mg/l
Phosphorus                              1.0
Speciation                                P
Name: 0, dtype: object
check_units(flag_col=None)

Check units.

Checks for bad units that are missing (assumes default_unit) or unrecognized as valid by unit registry (ureg). Does not check for units in the correct dimensions, or a mistaken identity (e.g. ‘deg F’ recognized as ‘degree * farad’).

Parameters:

flag_col (str, optional) – Column to reference in string for ‘QA_flags’. The default None uses WQCharData.col.unit_out attribute.

Return type:

None.

Examples

Build DataFrame to use as input:

>>> from pandas import DataFrame
>>> from numpy import nan
>>> df = DataFrame(
...   {
...     "CharacteristicName": [
...       "Phosphorus",
...       "Temperature, water",
...       "Phosphorus",
...     ],
...     "ResultMeasure/MeasureUnitCode": [
...       nan,
...       nan,
...       "Unknown",
...     ],
...     "ResultMeasureValue": [
...       "1.0",
...       "67.0",
...       "10",
...     ],
...   }
... )
>>> df
   CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue
0          Phosphorus                           NaN                1.0
1  Temperature, water                           NaN               67.0
2          Phosphorus                       Unknown                 10

Build WQ Characteristic Data class from pandas DataFrame:

>>> from harmonize_wq import wq_data
>>> wq = wq_data.WQCharData(df, 'Phosphorus')
>>> wq.df.Units
0        NaN
1        NaN
2    Unknown
Name: Units, dtype: object

Run check_units method to replace bad or missing units for phosphorus:

>>> wq.check_units()  
UserWarning: WARNING: 'Unknown' UNDEFINED UNIT for Phosphorus
>>> wq.df[['CharacteristicName', 'Units', 'QA_flag']]
   CharacteristicName Units                                            QA_flag
0          Phosphorus  mg/l  ResultMeasure/MeasureUnitCode: MISSING UNITS, ...
1  Temperature, water   NaN                                                NaN
2          Phosphorus  mg/l  ResultMeasure/MeasureUnitCode: 'Unknown' UNDEF...

Note: it didn’t infer units for ‘Temperature, water’ because wq is Phosphorus specific.

convert_units(default_unit=None, errors='raise')

Update out-col to convert units.

Update class out-col used to convert pandas.DataFrame. from old units to default_unit.

Parameters:
  • default_unit (str, optional) – Units to convert values to. Default None uses units attribute.

  • errors (str, optional) – Values of ‘ignore’, ‘raise’, or ‘skip’. The default is ‘raise’. If ‘raise’, invalid dimension conversions will raise an exception. If ‘skip’, invalid dimension conversions will not be converted. If ‘ignore’, invalid dimension conversions will be NaN.

Return type:

None.

Examples

Build pandas DataFrame to use as input:

>>> from pandas import DataFrame
>>> df = DataFrame({'CharacteristicName': ['Phosphorus', 'Temperature, water',],
...                 'ResultMeasure/MeasureUnitCode': ['mg/ml', 'deg C'],
...                 'ResultMeasureValue': ['1.0', '10.0',],
...                 })
>>> df
   CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue
0          Phosphorus                         mg/ml                1.0
1  Temperature, water                         deg C               10.0

Build WQ Characteristic Data class from pandas DataFrame:

>>> from harmonize_wq import wq_data
>>> wq = wq_data.WQCharData(df, 'Phosphorus')
>>> wq.convert_units()
>>> wq.df[['ResultMeasureValue', 'Units', 'Phosphorus']]
  ResultMeasureValue  Units                            Phosphorus
0                1.0  mg/ml  1000.0000000000001 milligram / liter
1               10.0    NaN                                   NaN
dimension_fixes()

Input/output for dimension handling.

Result dictionary key is old_unit and value is equation to get it into the desired dimension. Result list has substance to include as part of unit.

Notes

These are next processed interactively, one dimension at a time, except for mole conversions which are further split by basis (one at a time).

Returns:

  • dimension_dict (dict) – Dictionary with old_unit:new_unit.

  • mol_list (list) – List of Mole (substance) units.

Examples

Build pandas DataFrame to use as input:

>>> from pandas import DataFrame
>>> df = DataFrame({'CharacteristicName': ['Phosphorus', 'Phosphorus',],
...                 'ResultMeasure/MeasureUnitCode': ['mg/l', 'mg/kg',],
...                 'ResultMeasureValue': ['1.0', '10',],
...                 })
>>> df
  CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue
0         Phosphorus                          mg/l                1.0
1         Phosphorus                         mg/kg                 10

Build WQ Characteristic Data class from pandas DataFrame:

>>> from harmonize_wq import wq_data
>>> wq = wq_data.WQCharData(df, 'Phosphorus')
>>> wq.dimension_fixes()
({'mg/kg': 'mg/kg * H2O'}, [])
dimensions_list(m_mask=None)

Get list of unique unit dimensions.

Parameters:

m_mask (pandas.Series, optional) – Conditional mask to limit rows. The default None, uses measure_mask().

Returns:

List of units with mismatched dimensions.

Return type:

list

Examples

Build pandas DataFrame to use as input:

>>> from pandas import DataFrame
>>> df = DataFrame({'CharacteristicName': ['Phosphorus', 'Phosphorus',],
...                 'ResultMeasure/MeasureUnitCode': ['mg/l', 'mg/kg',],
...                 'ResultMeasureValue': ['1.0', '10',],
...                 })
>>> df
  CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue
0         Phosphorus                          mg/l                1.0
1         Phosphorus                         mg/kg                 10

Build WQ Characteristic Data class from pandas DataFrame:

>>> from harmonize_wq import wq_data
>>> wq = wq_data.WQCharData(df, 'Phosphorus')
>>> wq.dimensions_list()
['mg/kg']
fraction(frac_dict=None, catch_all=None, suffix=None, fract_col='ResultSampleFractionText')

Create columns for sample fractions using frac_dict to set names.

Parameters:
  • frac_dict (dict, optional) – Dictionary where {fraction_name : new_col}. The default None starts with an empty dictionary.

  • catch_all (str, optional) – Name for new field to map sample fractions not mapped by frac_dict

  • suffix (str, optional) – String to add to the end of any new column name. The default None, uses out_col attribute.

  • fract_col (str, optional) – Column name where sample fraction is defined. The default is ‘ResultSampleFractionText’.

Returns:

frac_dict – frac_dict updated to include any fract_col not already defined.

Return type:

dict

Examples

Build pandas DataFrame to use as input:

>>> from pandas import DataFrame
>>> df = DataFrame({'CharacteristicName': ['Phosphorus', 'Phosphorus',],
...                 'ResultMeasure/MeasureUnitCode': ['mg/l', 'mg/kg',],
...                 'ResultMeasureValue': ['1.0', '10',],
...                 'ResultSampleFractionText': ['Dissolved', ''],
...                 })
>>> df
  CharacteristicName  ... ResultSampleFractionText
0         Phosphorus  ...                Dissolved
1         Phosphorus  ...

[2 rows x 4 columns]

Build WQ Characteristic Data class from pandas DataFrame:

>>> from harmonize_wq import wq_data
>>> wq = wq_data.WQCharData(df, 'Phosphorus')

Go through required checks and conversions

>>> wq.check_units()
>>> dimension_dict, mol_list = wq.dimension_fixes()
>>> wq.replace_unit_by_dict(dimension_dict, wq.measure_mask())
>>> wq.moles_convert(mol_list)
>>> wq.convert_units()
>>> wq.df.columns
Index(['CharacteristicName', 'ResultMeasure/MeasureUnitCode',
       'ResultMeasureValue', 'ResultSampleFractionText', 'Units', 'Phosphorus',
       'QA_flag'],
      dtype='object')
>>> wq.df['Phosphorus']
0                   1.0 milligram / liter
1    10.000000000000002 milligram / liter
Name: Phosphorus, dtype: object

These results may have differen, non-comprable sample fractions. First, split results using a provided frac_dict (as used in harmonize()):

>>> from numpy import nan
>>> frac_dict = {'TP_Phosphorus': ['Total'],
...              'TDP_Phosphorus': ['Dissolved'],
...              'Other_Phosphorus': ['', nan],}
>>> wq.fraction(frac_dict)
>>> wq.df.columns
Index(['CharacteristicName', 'ResultMeasure/MeasureUnitCode',
       'ResultMeasureValue', 'ResultSampleFractionText', 'Units', 'Phosphorus',
       'QA_flag', 'TDP_Phosphorus', 'Other_Phosphorus'],
      dtype='object')
>>> wq.df[['TDP_Phosphorus', 'Other_Phosphorus']]
          TDP_Phosphorus                      Other_Phosphorus
0  1.0 milligram / liter                                   NaN
1                    NaN  10.000000000000002 milligram / liter

Alternatively, the sample fraction lists from tada can be used, in this case they are added:

>>> wq.fraction('TADA')
>>> wq.df.columns
Index(['CharacteristicName', 'ResultMeasure/MeasureUnitCode',
       'ResultMeasureValue', 'ResultSampleFractionText', 'Units', 'Phosphorus',
       'QA_flag', 'TDP_Phosphorus', 'Other_Phosphorus',
       'TOTAL PHOSPHORUS_ MIXED FORMS'],
      dtype='object')
>>> wq.df[['TOTAL PHOSPHORUS_ MIXED FORMS', 'Other_Phosphorus']]
  TOTAL PHOSPHORUS_ MIXED FORMS                      Other_Phosphorus
0         1.0 milligram / liter                                   NaN
1                           NaN  10.000000000000002 milligram / liter
measure_mask(column=None)

Get mask for characteristic and valid measure.

Mask is characteristic specific (c_mask) and only has valid col measures (Non-NA).

Parameters:

column (str, optional) – DataFrame column name to use. Default None uses WQCharData.out_col attribute.

Return type:

None.

Examples

Build DataFrame to use as input:

>>> from pandas import DataFrame
>>> from numpy import nan
>>> df = DataFrame(
...     {
...       'CharacteristicName': [
...         'Phosphorus',
...         'Temperature, water',
...         'Phosphorus',
...         'Phosphorus',
...       ],
...       'ResultMeasure/MeasureUnitCode': ['mg/l as P', nan, 'mg/l', 'mg/l',],
...       'ResultMeasureValue': ['1.0', '67.0', '10', 'None'],
...                 })
>>> df
   CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue
0          Phosphorus                     mg/l as P                1.0
1  Temperature, water                           NaN               67.0
2          Phosphorus                          mg/l                 10
3          Phosphorus                          mg/l               None

Build WQ Characteristic Data class from pandas DataFrame:

>>> from harmonize_wq import wq_data
>>> wq = wq_data.WQCharData(df, 'Phosphorus')

Check measure mask:

>>> wq.measure_mask()
0     True
1    False
2     True
3    False
dtype: bool
moles_convert(mol_list)

Update out_col with moles converted and reduce unit_col to units.

Parameters:

mol_list (list) – List of Mole (substance) units.

Return type:

None.

Examples

Build pandas DataFrame to use as input:

>>> from pandas import DataFrame
>>> from numpy import nan
>>> df = DataFrame({'CharacteristicName': ['Organic carbon', 'Organic carbon',],
...                 'ResultMeasure/MeasureUnitCode': ['mg/l', 'umol',],
...                 'ResultMeasureValue': ['1.0', '0.265',],
...                 'MethodSpecificationName': [nan, nan,],
...                 })
>>> df[['ResultMeasure/MeasureUnitCode', 'ResultMeasureValue']]
  ResultMeasure/MeasureUnitCode ResultMeasureValue
0                          mg/l                1.0
1                          umol              0.265

Build WQ Characteristic Data class from pandas DataFrame:

>>> from harmonize_wq import wq_data
>>> wq = wq_data.WQCharData(df, 'Organic carbon')
>>> wq.df
  CharacteristicName ResultMeasure/MeasureUnitCode  ... Units  Carbon
0     Organic carbon                          mg/l  ...  mg/l   1.000
1     Organic carbon                          umol  ...  umol   0.265

[2 rows x 6 columns]

Run required checks:

>>> wq.check_basis()
>>> wq.check_units()

Assemble dimensions dict and moles list:

>>> dimension_dict, mol_list = wq.dimension_fixes()
>>> dimension_dict
{'umol': '0.00018015999999999998 gram / l'}
>>> mol_list
['0.00018015999999999998 gram / l']

Replace units by dimension_dict:

>>> wq.replace_unit_by_dict(dimension_dict, wq.measure_mask())
>>> wq.df[['Units', 'Carbon']]
                             Units  Carbon
0                             mg/l   1.000
1  0.00018015999999999998 gram / l   0.265

Convert Carbon measure into whole units:

>>> wq.moles_convert(mol_list)
>>> wq.df[['Units', 'Carbon']]
          Units    Carbon
0          mg/l  1.000000
1  gram / liter  0.000048

This allows final conversion without dimensionality issues:

>>> wq.convert_units()
>>> wq.df['Carbon']
0          1.0 milligram / liter
1    0.0477424 milligram / liter
Name: Carbon, dtype: object
replace_unit_by_dict(val_dict, mask=None)

Do multiple replace_in_col() replacements using val_dict.

Replaces instances of val_dict key with val_dict value.

Parameters:
  • val_dict (dict) – Occurrences of key in the unit column are replaced with the value.

  • mask (pandas.Series, optional) – Conditional mask to limit rows. The default None, uses the c_mask attribute.

Return type:

None.

Examples

Build pandas DataFrame to use as input:

>>> from pandas import DataFrame
>>> df = DataFrame({'CharacteristicName': ['Fecal Coliform', 'Fecal Coliform',],
...                 'ResultMeasure/MeasureUnitCode': ['#/100ml', 'MPN',],
...                 'ResultMeasureValue': ['1.0', '10',],
...                 })
>>> df
  CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue
0     Fecal Coliform                       #/100ml                1.0
1     Fecal Coliform                           MPN                 10

Build WQ Characteristic Data class from pandas DataFrame:

>>> from harmonize_wq import wq_data
>>> wq = wq_data.WQCharData(df, 'Fecal Coliform')
>>> wq.df
  CharacteristicName ResultMeasure/MeasureUnitCode  ...    Units Fecal_Coliform
0     Fecal Coliform                       #/100ml  ...  #/100ml            1.0
1     Fecal Coliform                           MPN  ...      MPN           10.0

[2 rows x 5 columns]
>>> wq.replace_unit_by_dict(domains.UNITS_REPLACE['Fecal_Coliform'])
>>> wq.df
  CharacteristicName ResultMeasure/MeasureUnitCode  ...        Units Fecal_Coliform
0     Fecal Coliform                       #/100ml  ...  CFU/(100ml)            1.0
1     Fecal Coliform                           MPN  ...  MPN/(100ml)           10.0

[2 rows x 5 columns]
replace_unit_str(old, new, mask=None)

Replace ALL instances of old with in WQCharData.col.unit_out column.

Parameters:
  • old (str) – Sub-string to find and replace.

  • new (str) – Sub-string to replace old sub-string.

  • mask (pandas.Series, optional) – Conditional mask to limit rows. The default None, uses the c_mask attribute.

Examples

Build pandas DataFrame to use as input:

>>> from pandas import DataFrame
>>> df = DataFrame(
...     {
...       "CharacteristicName": ["Temperature, water", "Temperature, water",],
...       "ResultMeasure/MeasureUnitCode": ["deg C", "deg F",],
...       "ResultMeasureValue": ["31", "87",],
...     }
... )
>>> df
   CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue
0  Temperature, water                         deg C                 31
1  Temperature, water                         deg F                 87

Build WQ Characteristic Data class from pandas DataFrame:

>>> from harmonize_wq import wq_data
>>> wq = wq_data.WQCharData(df, 'Temperature, water')
>>> wq.df[['ResultMeasure/MeasureUnitCode', 'Units', 'Temperature']]
  ResultMeasure/MeasureUnitCode  Units  Temperature
0                         deg C  deg C           31
1                         deg F  deg F           87
>>> wq.replace_unit_str(' ', '')
>>> wq.df[['ResultMeasure/MeasureUnitCode', 'Units', 'Temperature']]
  ResultMeasure/MeasureUnitCode Units  Temperature
0                         deg C  degC           31
1                         deg F  degF           87
update_units(units_out)

Update class units attribute to convert everything into.

This just updates the attribute, it does not perform the conversion.

Parameters:

units_out (str) – Units to convert results into.

Return type:

None.

Examples

Build WQ Characteristic Data class:

>>> from harmonize_wq import wq_data
>>> wq = wq_data.WQCharData(df, 'Phosphorus')
>>> wq.units
'mg/l'
>>> wq.update_units('mg/kg')
>>> wq.units
'mg/kg'
update_ureg()

Update class unit registry to define units based on out_col.

harmonize_wq.wq_data.units_dimension(series_in, units, ureg=None)

List unique units not in desired units dimension.

Parameters:
  • series_in (pandas.Series) – Series of units.

  • units (str) – Desired units.

  • ureg (pint.UnitRegistry, optional) – Unit Registry Object with any custom units defined. The default is None.

Returns:

dim_list – List of units with mismatched dimensions.

Return type:

list

Examples

Build series to use as input:

>>> from pandas import Series
>>> unit_series = Series(['mg/l', 'mg/ml', 'g/kg'])
>>> unit_series
0     mg/l
1    mg/ml
2     g/kg
dtype: object

Get list of unique units not in desired units dimension ‘mg/l’:

>>> from harmonize_wq import wq_data
>>> wq_data.units_dimension(unit_series, units='mg/l')
['g/kg']

harmonize_wq.wrangle module

Functions to help re-shape the WQP pandas DataFrame.

harmonize_wq.wrangle.add_activities_to_df(df_in, mask=None)

Add activities to DataFrame.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame that will be updated.

  • mask (pandas.Series) – Row conditional mask to sub-set rows to get activities for. The default None, uses the entire set.

Returns:

df_merged – Table with added info from activities table by location id.

Return type:

pandas.DataFrame

Examples

Build example df_in table from harmonize_wq tests to use in place of Water Quality Portal query response, this table has ‘Temperature, water’ and ‘Phosphorous’ results:

>>> import pandas
>>> tests_url = 'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests'
>>> df1 = pandas.read_csv(tests_url + '/data/wqp_results.txt')
>>> df1.shape
(359505, 35)

Run on the first 1000 results:

>>> df2 = df1[:1000]
>>> from harmonize_wq import wrangle
>>> df_activities = wrangle.add_activities_to_df(df2)
>>> df_activities.shape
(1000, 100)

Look at the columns added:

>>> df_activities.columns[-65:]
Index(['ActivityTypeCode', 'ActivityMediaName', 'ActivityMediaSubdivisionName',
       'ActivityEndDate', 'ActivityEndTime/Time',
       'ActivityEndTime/TimeZoneCode', 'ActivityRelativeDepthName',
       'ActivityDepthHeightMeasure/MeasureValue',
       'ActivityDepthHeightMeasure/MeasureUnitCode',
       'ActivityDepthAltitudeReferencePointText',
       'ActivityTopDepthHeightMeasure/MeasureValue',
       'ActivityTopDepthHeightMeasure/MeasureUnitCode',
       'ActivityBottomDepthHeightMeasure/MeasureValue',
       'ActivityBottomDepthHeightMeasure/MeasureUnitCode', 'ProjectIdentifier',
       'ActivityConductingOrganizationText', 'ActivityCommentText',
       'SampleAquifer', 'HydrologicCondition', 'HydrologicEvent',
       'ActivityLocation/LatitudeMeasure', 'ActivityLocation/LongitudeMeasure',
       'ActivityLocation/SourceMapScaleNumeric',
       'ActivityLocation/HorizontalAccuracyMeasure/MeasureValue',
       'ActivityLocation/HorizontalAccuracyMeasure/MeasureUnitCode',
       'ActivityLocation/HorizontalCollectionMethodName',
       'ActivityLocation/HorizontalCoordinateReferenceSystemDatumName',
       'AssemblageSampledName', 'CollectionDuration/MeasureValue',
       'CollectionDuration/MeasureUnitCode', 'SamplingComponentName',
       'SamplingComponentPlaceInSeriesNumeric',
       'ReachLengthMeasure/MeasureValue', 'ReachLengthMeasure/MeasureUnitCode',
       'ReachWidthMeasure/MeasureValue', 'ReachWidthMeasure/MeasureUnitCode',
       'PassCount', 'NetTypeName', 'NetSurfaceAreaMeasure/MeasureValue',
       'NetSurfaceAreaMeasure/MeasureUnitCode',
       'NetMeshSizeMeasure/MeasureValue', 'NetMeshSizeMeasure/MeasureUnitCode',
       'BoatSpeedMeasure/MeasureValue', 'BoatSpeedMeasure/MeasureUnitCode',
       'CurrentSpeedMeasure/MeasureValue',
       'CurrentSpeedMeasure/MeasureUnitCode', 'ToxicityTestType',
       'SampleCollectionMethod/MethodIdentifier',
       'SampleCollectionMethod/MethodIdentifierContext',
       'SampleCollectionMethod/MethodName',
       'SampleCollectionMethod/MethodQualifierTypeName',
       'SampleCollectionMethod/MethodDescriptionText',
       'SampleCollectionEquipmentName',
       'SampleCollectionMethod/SampleCollectionEquipmentCommentText',
       'SamplePreparationMethod/MethodIdentifier',
       'SamplePreparationMethod/MethodIdentifierContext',
       'SamplePreparationMethod/MethodName',
       'SamplePreparationMethod/MethodQualifierTypeName',
       'SamplePreparationMethod/MethodDescriptionText',
       'SampleContainerTypeName', 'SampleContainerColorName',
       'ChemicalPreservativeUsedName', 'ThermalPreservativeUsedName',
       'SampleTransportStorageDescription', 'ActivityMetricUrl'],
      dtype='object')
harmonize_wq.wrangle.add_detection(df_in, char_val)

Add detection quantitation information for results where available.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame that will be updated.

  • char_val (str) – Specific characteristic name to apply to.

Returns:

df_merged – Table with added info from detection quantitation table columns.

Return type:

pandas.DataFrame

Examples

Build example df_in table from harmonize_wq tests to use in place of Water Quality Portal query response, this table has ‘Temperature, water’ and ‘Phosphorous’ results:

>>> import pandas
>>> tests_url = 'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests'
>>> df1 = pandas.read_csv(tests_url + '/data/wqp_results.txt')
>>> df1.shape
(359505, 35)

Run on the 1000 results to speed it up:

>>> df2 = df1[19000:20000]
>>> df2.shape
(1000, 35)
>>> from harmonize_wq import wrangle
>>> df_detects = wrangle.add_detection(df2, 'Phosphorus')
>>> df_detects.shape
(1001, 38)

Note: the additional rows are due to one result being able to be assigned multiple detection results. This is not the case for e.g., df1[:1000]

Look at the columns added:

>>> df_detects.columns[-3:]
Index(['DetectionQuantitationLimitTypeName',
       'DetectionQuantitationLimitMeasure/MeasureValue',
       'DetectionQuantitationLimitMeasure/MeasureUnitCode'],
      dtype='object')
harmonize_wq.wrangle.as_gdf(shp)

Get a GeoDataFrame for shp if shp is not already a GeoDataFrame.

Parameters:

shp (str) – Filename for something that needs to be a GeoDataFrame.

Returns:

shp – GeoDataFrame for shp if it isn’t already a GeoDataFrame.

Return type:

geopandas.GeoDataFrame

Examples

Use area of interest GeoJSON for Pensacola and Perdido Bays, FL from harmonize_wq tests:

>>> from harmonize_wq import wrangle
>>> aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson'
>>> type(wrangle.as_gdf(aoi_url))
<class 'geopandas.geodataframe.GeoDataFrame'>
harmonize_wq.wrangle.clip_stations(stations, aoi)

Clip stations to area of interest (aoi).

Locations and results are queried by extent rather than the exact geometry. Clipping by the exact geometry helps reduce the size of the results.

Notes

aoi is first transformed to CRS of stations.

Parameters:
Returns:

stations_gdf points clipped to the aoi_gdf.

Return type:

pandas.DataFrame

Examples

Build example geopandas GeoDataFrame of locations for stations:

>>> import geopandas
>>> from shapely.geometry import Point
>>> from numpy import nan
>>> d = {'MonitoringLocationIdentifier': ['In', 'Out'],
...      'geometry': [Point (-87.1250, 30.50000),
...                   Point (-87.5000, 30.50000),]}
>>> stations_gdf = geopandas.GeoDataFrame(d, crs="EPSG:4326")
>>> stations_gdf
  MonitoringLocationIdentifier                    geometry
0                           In  POINT (-87.12500 30.50000)
1                          Out  POINT (-87.50000 30.50000)

Use area of interest GeoJSON for Pensacola and Perdido Bays, FL from harmonize_wq tests:

>>> aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson'
>>> stations_in_aoi = harmonize_wq.wrangle.clip_stations(stations_gdf, aoi_url)
>>> stations_in_aoi
  MonitoringLocationIdentifier                    geometry
0                           In  POINT (-87.12500 30.50000)
harmonize_wq.wrangle.collapse_results(df_in, cols=None)

Group rows/results that seems like the same sample.

Default columns are organization, activity, location, and datetime.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame that will be updated.

  • cols (list, optional) – Columns to consider. The default is None.

Returns:

df_indexed – Updated DataFrame.

Return type:

pandas.DataFrame

Examples

See any of the ‘Simple’ notebooks found in demos for examples of how this function is used to combine rows with the same sample organization, activity, location, and datetime.

harmonize_wq.wrangle.get_activities_by_loc(characteristic_names, locations)

Segment batch what_activities.

Warning this is not fully implemented and may not stay. Retrieves in batch using dataretrieval.what_activities().

Parameters:
  • characteristic_names (list) – List of characteristic names to retrieve activities for.

  • locations (list) – List of location IDs to retrieve activities for.

Returns:

activities – Combined activities for locations.

Return type:

pandas.DataFrame

Examples

See wrangle.add_activities_to_df()

harmonize_wq.wrangle.get_bounding_box(shp, idx=None)

Get bounding box for spatial file (shp).

Parameters:
  • shp (spatial file) – Any geometry that is readable by geopandas.

  • idx (int, optional) – Index for geometry to get bounding box for. The default is None to return the total extent bounding box.

Return type:

Coordinates for bounding box as string and separated by ‘, ‘.

Examples

Use area of interest GeoJSON for Pensacola and Perdido Bays, FL from harmonize_wq tests:

>>> from harmonize_wq import wrangle
>>> aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson'
>>> wrangle.get_bounding_box(aoi_url)
'-87.72443263367131,30.27180869902194,-86.58972642899643,30.654976858733534'
harmonize_wq.wrangle.get_detection_by_loc(loc_series, result_id_series, char_val=None)

Get detection quantitation by location and characteristic (optional).

Retrieves detection quantitation results by location and characteristic name (optional). ResultIdentifier can not be used to search. Instead location id from loc_series is used and then results are limited by ResultIdentifiers from result_id_series.

Notes

There can be multiple Result Detection Quantitation limits / result. A result may have a ResultIdentifier without any corresponding data in the Detection Quantitation limits table (NaN in return).

Parameters:
  • loc_series (pandas.Series) – Series of location IDs to retrieve detection limits for.

  • result_id_series (pandas.Series) – Series of result IDs to limit retrieved data.

  • char_val (str, optional.) – Specific characteristic name to retrieve detection limits for. The default None, uses all ‘CharacteristicName’ values returned.

Returns:

df_out – Detection Quantitation limits table corresponding to input arguments.

Return type:

pandas.DataFrame

harmonize_wq.wrangle.merge_tables(df1, df2, df2_cols='all', merge_cols='activity')

Merge df1 and df2.

Merge tables(df1 and df2), adding df2_cols to df1 where merge_cols match.

Parameters:
  • df1 (pandas.DataFrame) – DataFrame that will be updated.

  • df2 (pandas.DataFrame) – DataFrame with new columns (df2_cols) that will be added to df1.

  • df2_cols (str, optional) – Columns in df2 to add to df1. The default is ‘all’, for all columns not already in df1.

  • merge_cols (str, optional) – Columns in both DataFrames to use in join. The default is ‘activity’, for a subset of columns in the activity df2.

Returns:

merged_results – Updated copy of df1.

Return type:

pandas.DataFrame

Examples

Build example table from harmonize_wq tests to use in place of Water Quality Portal query responses:

>>> import pandas
>>> tests_url = 'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests'
>>> df1 = pandas.read_csv(tests_url + '/data/wqp_results.txt')
>>> df1.shape
(359505, 35)
>>> df2 = pandas.read_csv(tests_url + '/data/wqp_activities.txt')
>>> df2.shape
(353911, 40)
>>> from harmonize_wq import wrangle
>>> merged = wrangle.merge_tables(df1, df2)
>>> merged.shape
(359505, 67)
harmonize_wq.wrangle.split_col(df_in, result_col='QA_flag', col_prefix='QA')

Move each row value from a column to a characteristic specific column.

Values are moved from the result_col in df_in to a new column where the column name is col_prefix + characteristic.

Parameters:
  • df_in (pandas.DataFrame) – DataFrame that will be updated.

  • result_col (str, optional) – Column name with results to split. The default is ‘QA_flag’.

  • col_prefix (str, optional) – Prefix to be added to new result column names. The default is ‘QA’.

Returns:

df – Updated DataFrame.

Return type:

pandas.DataFrame

Examples

See any of the ‘Simple’ notebooks found in demos for examples of how this function is used to split the QA column into multiple characteristic specific QA columns.

harmonize_wq.wrangle.split_table(df_in)

Split DataFrame columns axis into main and characteristic based.

Splits pandas.DataFrame in two, one with main results columns and one with Characteristic based metadata.

Notes

Runs clean.datetime() and clean.harmonize_depth() if expected columns (‘Activity_datetime’ and ‘Depth’) are missing.

Parameters:

df_in (pandas.DataFrame) – DataFrame that will be used to generate results.

Returns:

  • main_df (pandas.DataFrame) – DataFrame with main results.

  • chars_df (pandas.DataFrame) – DataFrame with Characteristic based metadata.

Examples

See any of the ‘Simple’ notebooks found in demos for examples of how this function is used to divide the table into columns of interest (main_df) and characteristic specific metadata (chars_df).

harmonize_wq.wrangle.to_simple_shape(gdf, out_shp)

Simplify GeoDataFrame for better export to shapefile.

Adopts and adapts ‘Simple’ from NWQMC/pywqp See domains.stations_rename() for renaming of columns.

Parameters:
  • gdf (geopandas.GeoDataFrame) – The GeoDataFrame to be exported to shapefile.

  • shp_out (str) – Shapefile directory and file name to be written.

Examples

Build example geopandas GeoDataFrame of locations for stations:

>>> import geopandas
>>> from shapely.geometry import Point
>>> from numpy import nan
>>> d = {'MonitoringLocationIdentifier': ['In', 'Out'],
...      'geometry': [Point (-87.1250, 30.50000),
...                   Point (-87.5000, 30.50000),]}
>>> gdf = geopandas.GeoDataFrame(d, crs="EPSG:4326")
>>> gdf
  MonitoringLocationIdentifier                    geometry
0                           In  POINT (-87.12500 30.50000)
1                          Out  POINT (-87.50000 30.50000)

Add datetime column

>>> gdf['ActivityStartDate'] = ['2004-09-01', '2004-02-18']
>>> gdf['ActivityStartTime/Time'] = ['10:01:00', '15:39:00']
>>> gdf['ActivityStartTime/TimeZoneCode'] = ['EST', 'EST']
>>> from harmonize_wq import clean
>>> gdf = clean.datetime(gdf)
>>> gdf
  MonitoringLocationIdentifier  ...         Activity_datetime
0                           In  ... 2004-09-01 15:01:00+00:00
1                          Out  ... 2004-02-18 20:39:00+00:00

[2 rows x 6 columns]
>>> from harmonize_wq import wrangle
>>> wrangle.to_simple_shape(gdf, 'dataframe.shp')