harmonize_wq package
harmonize_wq.basis module
Functions to process characteristic basis or return basis dictionary.
- harmonize_wq.basis.unit_basis_dict
Characteristic specific basis dictionary to define basis from units.
Notes
Dictionary with logic for determining basis from units string and standard
pint
units to replace those with. The structure is {Basis: {standard units: [unit strings with basis]}}.The out_col is often derived from
WQCharData.char_val
. The desired basis can be used as a key to subset result.Examples
Get dictionary for Phosphorus and subset for ‘as P’:
>>> from harmonize_wq import basis >>> basis.unit_basis_dict['Phosphorus']['as P'] {'mg/l': ['mg/l as P', 'mg/l P'], 'mg/kg': ['mg/kg as P', 'mg/kg P']}
- Type:
- harmonize_wq.basis.basis_conversion
Get dictionary of conversion factors to convert basis/speciation. For example, this is used to convert ‘as PO4’ to ‘as P’. Dictionary structure {basis: conversion factor}.
See also
convert.moles_to_mass()
Best Practices for Submitting Nutrient Data to the Water Quality eXchange
- Type:
- harmonize_wq.basis.stp_dict
Get standard temperature and pressure to define basis from units. Dictionary structure {‘standard temp’ : {‘units’: [values to replace]}}.
Notes
This needs to be updated to include pressure or needs to be renamed.
- Type:
- harmonize_wq.basis.basis_from_method_spec(df_in)
Copy speciation from MethodSpecificationName to new ‘Speciation’ column.
- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be updated.
- Returns:
df – Updated copy of df_in.
- Return type:
Examples
Build pandas DataFrame for example:
>>> from pandas import DataFrame >>> from numpy import nan >>> df = DataFrame({'CharacteristicName': ['Phosphorus', 'Phosphorus',], ... 'MethodSpecificationName': ['as P', nan], ... 'ProviderName': ['NWIS', 'NWIS',], ... }) >>> df CharacteristicName MethodSpecificationName ProviderName 0 Phosphorus as P NWIS 1 Phosphorus NaN NWIS
>>> from harmonize_wq import basis >>> basis.basis_from_method_spec(df) CharacteristicName MethodSpecificationName ProviderName Speciation 0 Phosphorus as P NWIS as P 1 Phosphorus NaN NWIS NaN
- harmonize_wq.basis.basis_from_unit(df_in, basis_dict, unit_col='Units', basis_col='Speciation')
Move basis from units to basis column in
pandas.DataFrame
.Move basis information from units in unit_col column to basis in basis_col column based on basis_dict. If basis_col does not exist in df_in it will be created. The unit_col column is updated in place. To maintain data integrity unit_col should not be the original ‘ResultMeasure/MeasureUnitCode’ column.
- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be updated.
basis_dict (dict) – Dictionary with structure {basis:{new_unit:[old_units]}}.
unit_col (str, optional) – String for the units column name in df_in to use. The default is ‘Units’.
basis_col (str, optional) – String for the basis column name in df_in to use. The default is ‘Speciation’.
- Returns:
df – Updated copy of df_in.
- Return type:
Examples
Build pandas DataFrame for example:
>>> from pandas import DataFrame >>> df = DataFrame({'CharacteristicName': ['Phosphorus', 'Phosphorus',], ... 'ResultMeasure/MeasureUnitCode': ['mg/l as P', 'mg/kg as P'], ... 'Units': ['mg/l as P', 'mg/kg as P'], ... }) >>> df CharacteristicName ResultMeasure/MeasureUnitCode Units 0 Phosphorus mg/l as P mg/l as P 1 Phosphorus mg/kg as P mg/kg as P
>>> from harmonize_wq import basis >>> basis_dict = basis.unit_basis_dict['Phosphorus'] >>> unit_col = 'Units' >>> basis.basis_from_unit(df, basis_dict, unit_col) CharacteristicName ResultMeasure/MeasureUnitCode Units Speciation 0 Phosphorus mg/l as P mg/l as P 1 Phosphorus mg/kg as P mg/kg as P
If an existing basis_col value is different, a warning is issued when it is updated and a QA_flag is assigned:
>>> from numpy import nan >>> df['Speciation'] = [nan, 'as PO4'] >>> df_speciation_change = basis.basis_from_unit(df, basis_dict, unit_col) ... UserWarning: Mismatched Speciation: updated from as PO4 to as P (units) >>> df_speciation_change[['Speciation', 'QA_flag']] Speciation QA_flag 0 as P NaN 1 as P Speciation: updated from as PO4 to as P (units)
- harmonize_wq.basis.set_basis(df_in, mask, basis, basis_col='Speciation')
Update or create basis_col with basis as value.
- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be updated.
mask (pandas.Series) – Row conditional mask to limit rows (e.g. to specific unit/speciation).
basis (str) – The string to use for basis.
basis_col (str, optional) – The new or existing column for basis string. The default is ‘Speciation’.
- Returns:
df_out – Updated copy of df_in.
- Return type:
Examples
Build pandas DataFrame for example:
>>> from pandas import DataFrame >>> df = DataFrame({'CharacteristicName': ['Phosphorus', ... 'Phosphorus', ... 'Salinity'], ... 'MethodSpecificationName': ['as P', 'as PO4', ''], ... }) >>> df CharacteristicName MethodSpecificationName 0 Phosphorus as P 1 Phosphorus as PO4 2 Salinity
Build mask for example:
>>> mask = df['CharacteristicName']=='Phosphorus'
>>> from harmonize_wq import basis >>> basis.set_basis(df, mask, basis='as P') CharacteristicName MethodSpecificationName Speciation 0 Phosphorus as P as P 1 Phosphorus as PO4 as P 2 Salinity NaN
- harmonize_wq.basis.update_result_basis(df_in, basis_col, unit_col)
Move basis from unit_col column to basis_col column.
This is usually used in place of basis_from_unit when the basis_col is not ‘ResultMeasure/MeasureUnitCode’ (i.e., not speciation).
Notes
Currently overwrites the original basis_col values rather than create many new empty columns. The original values are noted in the QA_flag.
- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be updated.
basis_col (str) – Column in df_in with result basis to update. Expected values are ‘ResultTemperatureBasisText’.
unit_col (str) – Column in df_in with units that may contain basis.
- Returns:
df_out – Updated copy of df_in.
- Return type:
Examples
Build pandas DataFrame for example:
>>> from pandas import DataFrame >>> from numpy import nan >>> df = DataFrame({'CharacteristicName': ['Salinity', 'Salinity',], ... 'ResultTemperatureBasisText': ['25 deg C', nan,], ... 'Units': ['mg/mL @25C', 'mg/mL @25C'], ... }) >>> df CharacteristicName ResultTemperatureBasisText Units 0 Salinity 25 deg C mg/mL @25C 1 Salinity NaN mg/mL @25C
>>> from harmonize_wq import basis >>> df_temp_basis = basis.update_result_basis(df, ... 'ResultTemperatureBasisText', ... 'Units') ... UserWarning: Mismatched ResultTemperatureBasisText: updated from 25 deg C to @25C (units) >>> df_temp_basis[['Units']] Units 0 mg/mL 1 mg/mL >>> df_temp_basis[['ResultTemperatureBasisText', 'QA_flag']] ResultTemperatureBasisText QA_flag 0 @25C ResultTemperatureBasisText: updated from 25 de... 1 @25C NaN
harmonize_wq.clean module
Functions to clean/correct additional columns in subset/entire dataset.
- harmonize_wq.clean.add_qa_flag(df_in, mask, flag)
Add flag to ‘QA_flag’ column in df_in.
- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be updated.
mask (pandas.Series) – Row conditional mask to limit rows.
flag (str) – Text to populate the new flag with.
- Returns:
df_out – Updated copy of df_in.
- Return type:
Examples
Build pandas DataFrame to use as input:
>>> from pandas import DataFrame >>> df = DataFrame({'CharacteristicName': ['Carbon', 'Phosphorus', 'Carbon',], ... 'ResultMeasureValue': ['1.0', '0.265', '2.1'],}) >>> df CharacteristicName ResultMeasureValue 0 Carbon 1.0 1 Phosphorus 0.265 2 Carbon 2.1
Assign simple flag string and mask to assign flag only to Carbon:
>>> flag = 'words' >>> mask = df['CharacteristicName']=='Carbon'
>>> from harmonize_wq import clean >>> clean.add_qa_flag(df, mask, flag) CharacteristicName ResultMeasureValue QA_flag 0 Carbon 1.0 words 1 Phosphorus 0.265 NaN 2 Carbon 2.1 words
- harmonize_wq.clean.check_precision(df_in, col, limit=3)
Add QA_flag if value in column has precision lower than limit.
Notes
Be cautious of float type and real vs representable precision.
- Parameters:
df_in (pandas.DataFrame) – DataFrame with the required ‘ResultDepthHeight’ columns.
unit_col (str) – Desired column in df_in.
limit (int, optional) – Number of decimal places under which to detect. The default is 3.
- Returns:
df_out – DataFrame with the quality assurance flag for precision.
- Return type:
- harmonize_wq.clean.datetime(df_in)
Format time using
dataretrieval
and ‘ActivityStart’ columns.- Parameters:
df_in (pandas.DataFrame) – DataFrame with the expected activity date, time and timezone columns.
- Returns:
df_out – DataFrame with the converted datetime column.
- Return type:
Examples
Build pandas DataFrame for example:
>>> from pandas import DataFrame >>> from numpy import nan >>> df = DataFrame({'ActivityStartDate': ['2004-09-01', '2004-07-01',], ... 'ActivityStartTime/Time': ['10:01:00', nan,], ... 'ActivityStartTime/TimeZoneCode': ['EST', nan], ... }) >>> df ActivityStartDate ActivityStartTime/Time ActivityStartTime/TimeZoneCode 0 2004-09-01 10:01:00 EST 1 2004-07-01 NaN NaN >>> from harmonize_wq import clean >>> clean.datetime(df) ActivityStartDate ... Activity_datetime 0 2004-09-01 ... 2004-09-01 15:01:00+00:00 1 2004-07-01 ... NaT [2 rows x 4 columns]
- harmonize_wq.clean.df_checks(df_in, columns=None)
Check
pandas.DataFrame
for columns.- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be checked.
columns (list, optional) – List of strings for column names. Default None, uses: ‘ResultMeasure/MeasureUnitCode’,’ResultMeasureValue’,’CharacteristicName’.
Examples
Build pandas DataFrame for example:
>>> from pandas import DataFrame >>> df = DataFrame({'CharacteristicName': ['Phosphorus'],}) >>> df CharacteristicName 0 Phosphorus
Check for existing column:
>>> from harmonize_wq import clean >>> clean.df_checks(df, columns=['CharacteristicName'])
If column is not in DataFrame it throws an AssertionError:
>>> clean.df_checks(df, columns=['ResultMeasureValue']) Traceback (most recent call last): ... AssertionError: ResultMeasureValue not in DataFrame
- harmonize_wq.clean.harmonize_depth(df_in, units='meter')
Create ‘Depth’ column with result depth values in consistent units.
New column combines values from the ‘ResultDepthHeightMeasure/MeasureValue’ column with units from the ‘ResultDepthHeightMeasure/MeasureUnitCode’ column.
Notes
Currently unit registry (ureg) updates or errors are not passed back. In the future activity depth columns may be considered if result depth missing.
- Parameters:
df_in (pandas.DataFrame) – DataFrame with the required ‘ResultDepthHeight’ columns.
units (str, optional) – Desired units. The default is ‘meter’.
- Returns:
df_out – DataFrame with new Depth column replacing ‘ResultDepthHeight’ columns.
- Return type:
Examples
Build pandas DataFrame for example:
>>> from pandas import DataFrame >>> from numpy import nan >>> df = DataFrame({'ResultDepthHeightMeasure/MeasureValue': ['3.0', nan, 10], ... 'ResultDepthHeightMeasure/MeasureUnitCode': ['m', nan, 'ft'], ... }) >>> df ResultDepthHeightMeasure/MeasureValue ResultDepthHeightMeasure/MeasureUnitCode 0 3.0 m 1 NaN NaN 2 10 ft
Get clean ‘Depth’ column:
>>> from harmonize_wq import clean >>> clean.harmonize_depth(df)[['ResultDepthHeightMeasure/MeasureValue', ... 'Depth']] ResultDepthHeightMeasure/MeasureValue Depth 0 3.0 3.0 meter 1 NaN NaN 2 10 3.0479999999999996 meter
- harmonize_wq.clean.methods_check(df_in, char_val, methods=None)
Check methods against list of accepted methods.
Notes
This is not fully implemented.
- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be updated.
char_val (str) – Characteristic name.
methods (dict, optional) – Dictionary where key is characteristic column name and value is list of dictionaries each with Source and Method keys. This allows updated methods dictionaries to be used. The default None uses the built-in
domains.accepted_methods()
.
- Returns:
accept – List of values from ‘ResultAnalyticalMethod/MethodIdentifier’ column in methods.
- Return type:
- harmonize_wq.clean.wet_dry_checks(df_in, mask=None)
Fix suspected errors in ‘ActivityMediaName’ column.
Uses the ‘ResultWeightBasisText’ and ‘ResultSampleFractionText’ columns to switch if the media is wet/dry where appropriate.
- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be updated.
mask (pandas.Series) – Row conditional (bool) mask to limit df rows to check/fix. The default is None.
- Returns:
df_out – Updated DataFrame.
- Return type:
- harmonize_wq.clean.wet_dry_drop(df_in, wet_dry='wet', char_val=None)
Restrict to only water or only sediment samples.
- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be updated.
wet_dry (str, optional) – Which values (Water/Sediment) to keep. The default is ‘wet’ (Water).
char_val (str, optional) – Apply to specific characteristic name. The default is None (for all).
- Returns:
df2 – Updated copy of df_in.
- Return type:
harmonize_wq.convert module
Functions to convert from one unit to another, at times using pint
decorators.
Contains several unit conversion functions not in pint
.
- harmonize_wq.convert.DO_concentration(val, pressure=<Quantity(1, 'standard_atmosphere')>, temperature=<Quantity(25, 'degree_Celsius')>)
Convert Dissolved Oxygen (DO) from concentration (mg/l) to saturation (%).
- Parameters:
val (pint.Quantity.build_quantity_class) – The DO value (converted to mg/L).
pressure (pint.Quantity, optional) – The pressure value. The default is 1*ureg(“atm”).
temperature (pint.Quantity, optional) – The temperature value. The default is ureg.Quantity(25, ureg(“degC”)).
- Returns:
Dissolved Oxygen (DO) as saturation (dimensionless).
- Return type:
Examples
Build units aware pint Quantity, as string:
>>> input_DO = '578 mg/l'
>>> from harmonize_wq import convert >>> convert.DO_concentration(input_DO) 6995.603308586222
- harmonize_wq.convert.DO_saturation(val, pressure=<Quantity(1, 'standard_atmosphere')>, temperature=<Quantity(25, 'degree_Celsius')>)
Convert Dissolved Oxygen (DO) from saturation (%) to concentration (mg/l).
Defaults assume STP where pressure is 1 atmosphere and temperature 25C.
- Parameters:
val (pint.Quantity.build_quantity_class) – The DO saturation value in dimensionless percent.
pressure (pint.Quantity, optional) – The pressure value. The default is 1*ureg(“atm”).
temperature (pint.Quantity, optional) – The temperature value. The default is ureg.Quantity(25, ureg(“degC”)).
- Returns:
DO value in mg/l.
- Return type:
Examples
>>> from harmonize_wq import convert >>> convert.DO_saturation(70) <Quantity(5.78363269, 'milligram / liter')>
At 2 atm (10m depth) >>> convert.DO_saturation(70, (‘2 standard_atmosphere’)) 11.746159340060716 milligram / liter
- harmonize_wq.convert.FNU_to_NTU(val)
Convert turbidity units from FNU (Formazin Nephelometric Units) to NTU.
- Parameters:
val (float) – The turbidity magnitude (FNU is dimensionless).
- Returns:
NTU – The turbidity magnitude (NTU is dimensionless).
- Return type:
Examples
Convert to NTU:
>>> from harmonize_wq import convert >>> convert.FNU_to_NTU(8) 10.136
- harmonize_wq.convert.JTU_to_NTU(val)
Convert turbidity units from JTU (Jackson Turbidity Units) to NTU.
Notes
This is based on linear relationship: 1 -> 19, 0.053 -> 1, 0.4 -> 7.5
- Parameters:
val (pint.Quantity) – The turbidity value in JTU (dimensionless).
- Returns:
NTU – The turbidity value in dimensionless NTU.
- Return type:
Examples
JTU is not a standard pint unit and must be added to a unit registry first (normally done by WQCharData.update_ureg() method):
>>> import pint >>> ureg = pint.UnitRegistry() >>> from harmonize_wq import domains >>> for definition in domains.registry_adds_list('Turbidity'): ... ureg.define(definition)
Build JTU units aware pint Quantity:
>>> turbidity = ureg.Quantity('JTU') >>> str(turbidity) '1 Jackson_Turbidity_Units' >>> type(turbidity) <class 'pint.Quantity'>
Convert to NTU:
>>> from harmonize_wq import convert >>> str(convert.JTU_to_NTU(str(turbidity))) '18.9773 Nephelometric_Turbidity_Units' >>> type(convert.JTU_to_NTU(str(turbidity))) <class 'pint.Quantity'>
- harmonize_wq.convert.NTU_to_cm(val)
Convert turbidity in NTU (Nephelometric Turbidity Units) to centimeters.
- Parameters:
val (pint.Quantity) – The turbidity value in NTU.
- Returns:
The turbidity value in centimeters.
- Return type:
Examples
NTU is not a standard pint unit and must be added to a unit registry first (normally done by WQCharData.update_ureg() method):
>>> import pint >>> ureg = pint.UnitRegistry() >>> from harmonize_wq import domains >>> for definition in domains.registry_adds_list('Turbidity'): ... ureg.define(definition)
Build NTU aware pint pint Quantity:
>>> turbidity = ureg.Quantity('NTU') >>> str(turbidity) '1 Nephelometric_Turbidity_Units' >>> type(turbidity) <class 'pint.Quantity'>
Convert to cm:
>>> from harmonize_wq import convert >>> str(convert.NTU_to_cm('1 NTU')) '241.27 centimeter' >>> type(convert.NTU_to_cm('1 NTU')) <class 'pint.Quantity'>
- harmonize_wq.convert.PSU_to_density(val, pressure=<Quantity(1, 'standard_atmosphere')>, temperature=<Quantity(25, 'degree_Celsius')>)
Convert salinity as Practical Salinity Units (PSU) to density.
Dimensionality changes from dimensionless Practical Salinity Units (PSU) to mass/volume density.
- Parameters:
val (pint.Quantity) – The salinity value in dimensionless PSU.
pressure (pint.Quantity, optional) – The pressure value. The default is 1*ureg(“atm”).
temperature (pint.Quantity, optional) – The temperature value. The default is ureg.Quantity(25, ureg(“degC”)).
- Returns:
density – The salinity value in density units (mg/ml).
- Return type:
pint.Quantity.build_quantity_class
Examples
PSU is not a standard pint unit and must be added to a unit registry first. This can be done using the WQCharData.update_ureg method:
>>> import pint >>> ureg = pint.UnitRegistry() >>> from harmonize_wq import domains >>> for definition in domains.registry_adds_list('Salinity'): ... ureg.define(definition)
Build units aware pint Quantity, as string because it is an altered unit registry:
>>> unit = ureg.Quantity('PSU') >>> unit <Quantity(1, 'Practical_Salinity_Units')>
>>> type(unit) <class 'pint.Quantity'>
>>> input_psu = str(8*unit) >>> input_psu '8 Practical_Salinity_Units'
Convert to density:
>>> from harmonize_wq import convert >>> str(convert.PSU_to_density(input_psu)) '997.0540284772519 milligram / milliliter'
- harmonize_wq.convert.SiO2_to_NTU(val)
Convert turbidity units from SiO2 (silicon dioxide) to NTU.
Notes
This is based on a linear relationship: 0.13 -> 1, 1 -> 7.5, 2.5 -> 19
- Parameters:
val (pint.Quantity.build_quantity_class) – The turbidity value in SiO2 units (dimensionless).
- Returns:
NTU – The turbidity value in dimensionless NTU.
- Return type:
pint.Quantity.build_quantity_class
Examples
SiO2 is not a standard pint unit and must be added to a unit registry first (normally done using WQCharData.update_ureg() method):
>>> import pint >>> ureg = pint.UnitRegistry() >>> from harmonize_wq import domains >>> for definition in domains.registry_adds_list('Turbidity'): ... ureg.define(definition)
Build SiO2 units aware pint Quantity:
>>> turbidity = ureg.Quantity('SiO2') >>> str(turbidity) '1 SiO2' >>> type(turbidity) <class 'pint.Quantity'>
Convert to NTU:
>>> from harmonize_wq import convert >>> str(convert.SiO2_to_NTU(str(turbidity))) '7.5701 Nephelometric_Turbidity_Units' >>> type(convert.SiO2_to_NTU(str(turbidity))) <class 'pint.Quantity'>
- harmonize_wq.convert.cm_to_NTU(val)
Convert turbidity measured in centimeters to NTU.
- Parameters:
val (pint.Quantity) – The turbidity value in centimeters.
- Returns:
The turbidity value in NTU.
- Return type:
Examples
Build standard pint unit registry:
>>> import pint >>> ureg = pint.UnitRegistry()
Build cm units aware pint Quantity (already in standard unit registry):
>>> turbidity = ureg.Quantity('cm') >>> str(turbidity) '1 centimeter' >>> type(turbidity) <class 'pint.Quantity'>
Convert to cm:
>>> from harmonize_wq import convert >>> str(convert.cm_to_NTU(str(turbidity))) '3941.8 Nephelometric_Turbidity_Units' >>> type(convert.cm_to_NTU(str(turbidity))) <class 'pint.Quantity'>
- harmonize_wq.convert.conductivity_to_PSU(val, pressure=<Quantity(0, 'standard_atmosphere')>, temperature=<Quantity(25, 'degree_Celsius')>)
Estimate salinity (PSU) from conductivity.
- Parameters:
val (pint.Quantity.build_quantity_class) – The conductivity value (converted to microsiemens / centimeter).
pressure (pint.Quantity, optional) – The pressure value. The default is 0*ureg(“atm”).
temperature (pint.Quantity, optional) – The temperature value. The default is ureg.Quantity(25, ureg(“degC”)).
- Returns:
Estimated salinity (PSU).
- Return type:
Notes
Conductivity to salinity conversion PSS 1978 method. c-numeric conductivity in uS (microsiemens). t-numeric Celsius temperature (defaults to 25). P-numeric optional pressure (defaults to 0).
References
IOC, SCOR and IAPSO, 2010: The international thermodynamic equation of seawater – 2010: Calculation and use of thermodynamic properties. Intergovernmental Oceanographic Commission, Manuals and Guides No. 56, UNESCO (English), 196 pp.
Alan D. Jassby and James E. Cloern (2015). wq: Some tools for exploring water quality monitoring data. R package v0.4.4. See the ec2pss function.
Adapted from R cond2sal_shiny
Examples
PSU (Practical Salinity Units) is not a standard pint unit and must be added to a unit registry first:
>>> import pint >>> ureg = pint.UnitRegistry() >>> from harmonize_wq import domains >>> for definition in domains.registry_adds_list('Salinity'): ... ureg.define(definition)
Build units aware pint Quantity, as string:
>>> input_conductivity = '111.0 uS/cm'
Convert to Practical Salinity Units:
>>> from harmonize_wq import convert >>> convert.conductivity_to_PSU(input_conductivity) <Quantity(0.057, 'dimensionless')>
- harmonize_wq.convert.convert_unit_series(quantity_series, unit_series, units, ureg=None, errors='raise')
Convert quantities to consistent units.
Convert list of quantities (quantity_list), each with a specified old unit, to a quantity in units using
pint
constructor method.- Parameters:
quantity_series (pandas.Series) – List of quantities. Values should be numeric, must not include NaN.
unit_series (pandas.Series) – List of units for each quantity in quantity_series. Values should be string, must not include NaN.
units (str) – Desired units.
ureg (pint.UnitRegistry, optional) – Unit Registry Object with any custom units defined. The default is None.
errors (str, optional) – Values of ‘ignore’, ‘raise’, or ‘skip’. The default is ‘raise’. If ‘raise’, invalid dimension conversions will raise an exception. If ‘skip’, invalid dimension conversions will not be converted. If ‘ignore’, invalid dimension conversions will return the NaN.
- Returns:
Converted values from quantity_series in units with original index.
- Return type:
Examples
Build series to use as input:
>>> from pandas import Series >>> quantity_series = Series([1, 10]) >>> unit_series = Series(['mg/l', 'mg/ml',])
Convert series to series of pint Quantity objects in ‘mg/l’:
>>> from harmonize_wq import convert >>> convert.convert_unit_series(quantity_series, unit_series, units = 'mg/l') 0 1.0 milligram / liter 1 10000.000000000002 milligram / liter dtype: object
- harmonize_wq.convert.density_to_PSU(val, pressure=<Quantity(1, 'standard_atmosphere')>, temperature=<Quantity(25, 'degree_Celsius')>)
Convert salinity as density (mass/volume) to Practical Salinity Units.
- Parameters:
val (pint.Quantity.build_quantity_class) – The salinity value in density units.
pressure (pint.Quantity.build_quantity_class, optional) – The pressure value. The default is 1*ureg(“atm”).
temperature (pint.Quantity.build_quantity_class, optional) – The temperature value. The default is ureg.Quantity(25, ureg(“degC”)).
- Returns:
PSU – The salinity value in dimensionless PSU.
- Return type:
pint.Quantity.build_quantity_class
Examples
PSU (Practical Salinity Units) is not a standard pint unit and must be added to a unit registry first (normally done by WQCharData.update_ureg() method):
>>> import pint >>> ureg = pint.UnitRegistry() >>> from harmonize_wq import domains >>> for definition in domains.registry_adds_list('Salinity'): ... ureg.define(definition)
Build units aware pint Quantity, as string:
>>> input_density = '1000 milligram / milliliter'
Convert to Practical Salinity Units:
>>> from harmonize_wq import convert >>> convert.density_to_PSU(input_density) <Quantity(4.71542857, 'gram / kilogram')>
- harmonize_wq.convert.mass_to_moles(ureg, char_val, Q_)
Convert a mass to moles substance.
- Parameters:
ureg (pint.UnitRegistry) – Unit Registry Object with any custom units defined.
char_val (str) – Characteristic name to use to find corresponding molecular weight.
Q (pint.Quantity) – Mass to convert to moles.
- Returns:
Value in moles of substance.
- Return type:
Examples
Build standard pint unit registry:
>>> import pint >>> ureg = pint.UnitRegistry()
Build pint quantity:
>>> Q_ = 1 * ureg('g')
>>> from harmonize_wq import convert >>> str(convert.mass_to_moles(ureg, 'Phosphorus', Q_)) '0.03228931223764934 mole'
- harmonize_wq.convert.moles_to_mass(ureg, Q_, basis=None, char_val=None)
Convert moles substance to mass.
Either basis or char_val must have a non-default value.
- Parameters:
ureg (pint.UnitRegistry) – Unit Registry Object with any custom units defined.
Q (ureg.Quantity) – Quantity (measure and units).
basis (str, optional) – Speciation (basis) of measure to determine molecular weight. Default is None.
char_val (str, optional) – Characteristic Name to use when converting moles substance to mass. Default is None.
- Returns:
Value in mass (g).
- Return type:
Examples
Build standard pint unit registry:
>>> import pint >>> ureg = pint.UnitRegistry()
Build quantity:
>>> Q_ = 0.265 * ureg('umol')
>>> from harmonize_wq import convert >>> str(convert.moles_to_mass(ureg, Q_, basis='as P')) '8.20705e-06 gram'
harmonize_wq.domains module
Functions to return domain lists with all potential values.
These are mainly for use as filters. Small or frequently utilized domains may be hard-coded. A URL based method can be used to get the most up to date domain list.
- harmonize_wq.domains.accepted_methods
Get accepted methods for each characteristic. Dictionary where key is characteristic column name and value is list of dictionaries each with Source and Method keys.
Notes
Source should be in ‘ResultAnalyticalMethod/MethodIdentifierContext’ column. This is not fully implemented.
- Type:
- harmonize_wq.domains.stations_rename
Get shortened column names for shapefile (.shp) fields.
Dictionary where key = WQP field name and value = short name for .shp.
ESRI places a length restriction on shapefile (.shp) field names. This returns a dictionary with the original water quality portal field name (as key) and shortened column name for writing as .shp. We suggest using the longer original name as the field alias when writing as .shp.
Examples
Although running the function returns the full dictionary of Key:Value pairs, here we show how the current name can be used as a key to get the new name:
>>> domains.stations_rename['OrganizationIdentifier'] 'org_ID'
- Type:
- harmonize_wq.domains.xy_datum
- Get dictionary of expected horizontal datums, where exhaustive:
{HorizontalCoordinateReferenceSystemDatumName: {Description:str, EPSG:int}}
The structure has {key as expected string: value as {“Description”: string and “EPSG”: integer (4-digit code)}.
Notes
source WQP: HorizontalCoordinateReferenceSystemDatum_CSV.zip
Anything not in dict will be NaN, and non-integer EPSG will be missing: “OTHER”: {“Description”: ‘Other’, “EPSG”: nan}, “UNKWN”: {“Description”: ‘Unknown’, “EPSG”: nan}
Examples
Running the function returns the full dictionary with {abbreviation: {‘Description’:values, ‘EPSG’:values}}. The abbreviation key can be used to get the EPSG code:
>>> domains.xy_datum['NAD83'] {'Description': 'North American Datum 1983', 'EPSG': 4269} >>> domains.xy_datum['NAD83']['EPSG'] 4269
- Type:
- harmonize_wq.domains.char_tbl_TADA(df, char)
Get structured dictionary for TADA.CharacteristicName from TADA df.
- Parameters:
df (pandas.DataFrame) – Table from TADA for specific characteristic.
char (str) – CharacteristicName.
- Returns:
new_char_dict –
- Returned dictionary follows general structure:
- {
- “Target.TADA.CharacteristicName”: {
- “Target.TADA.ResultSampleFractionText”: [
“Target.TADA.ResultSampleFractionText”
]
}
}
- Return type:
- harmonize_wq.domains.characteristic_cols(category=None)
Get characteristic specific columns list, can subset those by category.
- Parameters:
category (str, optional) – Subset results: ‘Basis’, ‘Bio’, ‘Depth’, ‘QA’, ‘activity’, ‘analysis’, ‘depth’, ‘measure’, ‘sample’. The default is None.
- Returns:
col_list – List of columns.
- Return type:
Examples
Running the function without a category returns the full list of column names, including a category returns only the columns in that category:
>>> domains.characteristic_cols('QA') ['ResultDetectionConditionText', 'ResultStatusIdentifier', 'PrecisionValue', 'DataQuality/BiasValue', 'ConfidenceIntervalValue', 'UpperConfidenceLimitValue', 'LowerConfidenceLimitValue', 'ResultCommentText', 'ResultSamplingPointName', 'ResultDetectionQuantitationLimitUrl']
- harmonize_wq.domains.get_domain_dict(table, cols=None)
Get domain values for specified table.
- Parameters:
- Returns:
Dictionary where {cols[0]: cols[1]}
- Return type:
Examples
Return dictionary for domain from WQP table (e.g., ‘ResultSampleFraction’), The default keys (‘Name’) are shown as values (‘Description’) are long:
>>> from harmonize_wq import domains >>> domains.get_domain_dict('ResultSampleFraction').keys() dict_keys(['Acid Soluble', 'Bed Sediment', 'Bedload', 'Bioavailable', 'Comb Available', 'Dissolved', 'Extractable', 'Extractable, CaCO3-bound', 'Extractable, exchangeable', 'Extractable, organic-bnd', 'Extractable, other', 'Extractable, oxide-bound', 'Extractable, residual', 'Field***', 'Filter/sieve residue', 'Filterable', 'Filtered field and/or lab', 'Filtered, field', 'Filtered, lab', 'Fixed', 'Free Available', 'Inorganic', 'Leachable', 'Net (Hot)', 'Non-Filterable (Particle)', 'Non-settleable', 'Non-volatile', 'None', 'Organic', 'Pot. Dissolved', 'Semivolatile', 'Settleable', 'Sieved', 'Strong Acid Diss', 'Supernate', 'Suspended', 'Total', 'Total Recoverable', 'Total Residual', 'Total Soluble', 'Unfiltered', 'Unfiltered, field', 'Vapor', 'Volatile', 'Weak Acid Diss', 'Yield', 'non-linear function'])
- harmonize_wq.domains.harmonize_TADA_dict()
Get structured dictionary from TADA HarmonizationTemplate csv.
Based on target column names and sample fractions.
- Returns:
full_dict –
- {‘TADA.CharacteristicName’:
- {Target.TADA.CharacteristicName:
- {Target.TADA.ResultSampleFractionText :
[Target.TADA.ResultSampleFractionText]}}}
- Return type:
- harmonize_wq.domains.re_case(word, domain_list)
Change instance of word in domain_list to UPPERCASE.
- harmonize_wq.domains.registry_adds_list(out_col)
Get units to add to
pint
unit registry by out_col column.Typically out_col refers back to column used for a value from the ‘CharacteristicName’ column.
- Parameters:
out_col (str) – The result column a unit registry is being built for.
- Returns:
List of strings with unit additions in expected format.
- Return type:
Examples
Generate a new pint unit registry object for e.g., Sediment:
>>> from harmonize_wq import domains >>> domains.registry_adds_list('Sediment') ['fraction = [] = frac', 'percent = 1e-2 frac', 'parts_per_thousand = 1e-3 = ppth', 'parts_per_million = 1e-6 fraction = ppm']
harmonize_wq.harmonize module
Functions to harmonize data retrieved from EPA’s Water Quality Portal.
- harmonize_wq.harmonize.dissolved_oxygen(wqp)
Standardize ‘Dissolved Oxygen (DO)’ characteristic.
Uses
wq_data.WQCharData
to check units, check unit dimensionality and perform appropriate unit conversions.- Parameters:
wqp (wq_data.WQCharData) – WQP Characteristic Info Object to check units, check unit dimensionality and perform appropriate unit conversions.
- Returns:
wqp – WQP Characteristic Info Object with updated attributes.
- Return type:
- harmonize_wq.harmonize.harmonize(df_in, char_val, units_out=None, errors='raise', intermediate_columns=False, report=False)
Harmonize char_val rows based methods specific to that char_val.
All rows where the value in the ‘CharacteristicName’ column matches char_val will have their results harmonized based on available methods for that char_val.
- Parameters:
df_in (pandas.DataFrame) – DataFrame with the expected columns (change based on char_val).
char_val (str) – Target value in ‘CharacteristicName’ column.
units_out (str, optional) – Desired units to convert results into. The default None, uses the constant domains.OUT_UNITS.
errors (str, optional) – Values of ‘ignore’, ‘raise’, or ‘skip’. The default is ‘raise’. If ‘raise’, then invalid dimension conversions will raise an exception. If ‘skip’, then invalid dimension conversions will not be converted. If ‘ignore’, then invalid dimension conversions will return the NaN.
intermediate_columns (Boolean, optional) – Return intermediate columns. Default ‘False’ does not return these.
report (bool, optional) – Print a change summary report. The default is False.
- Returns:
df – Updated copy of df_in.
- Return type:
Examples
Build example df_in table from harmonize_wq tests to use in place of Water Quality Portal query response, this table has ‘Temperature, water’ and ‘Phosphorous’ results:
>>> import pandas >>> tests_url = 'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests' >>> df1 = pandas.read_csv(tests_url + '/data/wqp_results.txt') >>> df1.shape (359505, 35)
>>> from harmonize_wq import harmonize >>> df_result = harmonize.harmonize(df1, 'Temperature, water') >>> df_result OrganizationIdentifier ... Temperature 0 21FLHILL_WQX ... 29.93 degree_Celsius 1 21FLHILL_WQX ... 17.82 degree_Celsius 2 21FLGW_WQX ... 22.42 degree_Celsius 3 21FLMANA_WQX ... 30.0 degree_Celsius 4 21FLHILL_WQX ... 30.37 degree_Celsius ... ... ... ... 359500 21FLHILL_WQX ... 28.75 degree_Celsius 359501 21FLHILL_WQX ... 23.01 degree_Celsius 359502 21FLTBW_WQX ... 29.97 degree_Celsius 359503 21FLPDEM_WQX ... 32.01 degree_Celsius 359504 21FLSMRC_WQX ... NaN [359505 rows x 37 columns]
List columns that were added:
>>> df_result.columns[-2:] Index(['QA_flag', 'Temperature'], dtype='object')
See also
See any of the ‘Detailed’ notebooks found in ‘demos<https://github.com/USEPA/harmonize-wq/tree/main/demos>’ for examples of how this function is used to standardize, clean, and wrangle a Water Quality Portal query response, one ‘CharacteristicName’ value at a time.
- harmonize_wq.harmonize.harmonize_all(df_in, errors='raise')
Harmonizes all ‘CharacteristicNames’ column values with methods.
All results are standardized to default units. Intermediate columns are not retained. See
domains.out_col_lookup()
for list of values with methods.- Parameters:
df_in (pandas.DataFrame) – DataFrame with the expected columns (changes based on values in ‘CharacteristicNames’ column).
errors (str, optional) – Values of ‘ignore’, ‘raise’, or ‘skip’. The default is ‘raise’. If ‘raise’, invalid dimension conversions will raise an exception. If ‘skip’, invalid dimension conversions will not be converted. If ‘ignore’, invalid dimension conversions will return the NaN.
- Returns:
df – Updated copy of df_in.
- Return type:
Examples
Build example df_in table from harmonize_wq tests to use in place of Water Quality Portal query response, this table has ‘Temperature, water’ and ‘Phosphorous’ results:
>>> import pandas >>> tests_url = 'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests' >>> df1 = pandas.read_csv(tests_url + '/data/wqp_results.txt') >>> df1.shape (359505, 35)
When running the function there may be read outs or warnings, as things are encountered such as unexpected nutrient sample fractions:
>>> from harmonize_wq import harmonize >>> df_result_all = harmonize.harmonize_all(df1) 1 Phosphorus sample fractions not in frac_dict 1 Phosphorus sample fractions not in frac_dict found in expected domains, mapped to "Other_Phosphorus"
>>> df_result_all OrganizationIdentifier ... Temperature 0 21FLHILL_WQX ... 29.93 degree_Celsius 1 21FLHILL_WQX ... 17.82 degree_Celsius 2 21FLGW_WQX ... 22.42 degree_Celsius 3 21FLMANA_WQX ... 30.0 degree_Celsius 4 21FLHILL_WQX ... 30.37 degree_Celsius ... ... ... ... 359500 21FLHILL_WQX ... 28.75 degree_Celsius 359501 21FLHILL_WQX ... 23.01 degree_Celsius 359502 21FLTBW_WQX ... 29.97 degree_Celsius 359503 21FLPDEM_WQX ... 32.01 degree_Celsius 359504 21FLSMRC_WQX ... NaN [359505 rows x 42 columns]
List columns that were added:
>>> sorted(list(df_result_all.columns[-7:])) ... ['Other_Phosphorus', 'Phosphorus', 'QA_flag', 'Speciation', 'TDP_Phosphorus', 'TP_Phosphorus', 'Temperature']
See also
See any of the ‘Simple’ notebooks found in ‘demos<https://github.com/USEPA/harmonize-wq/tree/main/demos>’ for examples of how this function is used to standardize, clean, and wrangle a Water Quality Portal query response.
- harmonize_wq.harmonize.salinity(wqp)
Standardize ‘Salinity’ characteristic.
Uses
wq_data.WQCharData
to check basis, check units, check unit dimensionality and perform appropriate unit conversions.Notes
PSU=PSS=ppth and ‘ppt’ is picopint in
pint
so it is changed to ‘ppth’.- Parameters:
wqp (wq_data.WQCharData) – WQP Characteristic Info Object.
- Returns:
wqp – WQP Characteristic Info Object with updated attributes.
- Return type:
- harmonize_wq.harmonize.sediment(wqp)
Standardize ‘Sediment’ characteristic.
Uses
wq_data.WQCharData
to check basis, check units, and check unit dimensionality.- Parameters:
wqp (wq_data.WQCharData) – WQP Characteristic Info Object.
- Returns:
wqp – WQP Characteristic Info Object with updated attributes.
- Return type:
- harmonize_wq.harmonize.turbidity(wqp)
Standardize ‘Turbidity’ characteristic.
Uses
wq_data.WQCharData
to check units, check unit dimensionality and perform appropriate unit conversionsNotes
See USGS Report Chapter A6. Section 6.7. Turbidity See ASTM DÍ-17 for equivalent unit definitions: ‘NTU’ - 400-680nm (EPA 180.1), range 0.0-40. ‘NTRU’ - 400-680nm (2130B), range 0-10,000. ‘NTMU’ - 400-680nm. ‘FNU’ - 780-900nm (ISO 7027), range 0-1000. ‘FNRU’ - 780-900nm (ISO 7027), range 0-10,000. ‘FAU’ - 780-900nm, range 20-1000. Older methods: ‘FTU’ - lacks instrumentation specificity ‘SiO2’ (ppm or mg/l) - concentration of calibration standard (=JTU) ‘JTU’ - candle instead of formazin standard, near 40 NTU these may be equivalent, but highly variable. Conversions used: cm <-> NTU see
convert.cm_to_NTU()
from USU.Alternative conversions available but not currently used by default:
convert.FNU_to_NTU()
from Gohin (2011) Ocean Sci., 7, 705–732 https://doi.org/10.5194/os-7-705-2011.convert.SiO2_to_NTU()
linear relation from Otilia et al. 2013.convert.JTU_to_NTU()
linear relation from Otilia et al. 2013.Otilia, Rusănescu Carmen, Rusănescu Marin, and Stoica Dorel. MONITORING OF PHYSICAL INDICATORS IN WATER SAMPLES. https://hidraulica.fluidas.ro/2013/nr_2/84_89.pdf.
- Parameters:
wqp (wq_data.WQCharData) – WQP Characteristic Info Object.
- Returns:
wqp – WQP Characteristic Info Object with updated attributes.
- Return type:
harmonize_wq.location module
Functions to clean/correct location data.
- harmonize_wq.location.get_harmonized_stations(query, aoi=None)
Query, harmonize and clip stations.
Queries the Water Quality Portal for stations with data matching the query, harmonizes those stations’ location information, and clips it to the area of interest (aoi) if specified.
See www.waterqualitydata.us/webservices_documentation for API reference.
- Parameters:
query (dict) – Water Quality Portal query as dictionary.
aoi (geopandas.GeoDataFrame, optional) – Area of interest to clip stations to. The default None returns all stations in the query extent.
- Returns:
stations_gdf (
geopandas.GeoDataFrame
) – Harmonized stations.stations (
pandas.DataFrame
) – Raw station results from WQP.site_md (
dataretrieval.utils.Metadata
) – Customdataretrieval
metadata object pertaining to the WQP query.
Examples
See any of the ‘Simple’ notebooks found in ‘demos<https://github.com/USEPA/harmonize-wq/tree/main/demos>’_ for examples of how this function is used to query and harmonize stations.
- harmonize_wq.location.harmonize_locations(df_in, out_EPSG=4326, intermediate_columns=False, **kwargs)
Create harmonized geopandas GeoDataframe from pandas DataFrame.
Takes a
DataFrame
with lat/lon in multiple Coordinate Reference Systems (CRS), transforms them to out_EPSG CRS, and converts togeopandas.GeoDataFrame
. A ‘QA_flag’ column is added to the result and populated for any row that has location based problems like limited decimal precision or an unknown input CRS.- Parameters:
df_in (pandas.DataFrame) – DataFrame with the required columns (see kwargs for expected defaults) to be converted to GeoDataFrame.
out_EPSG (int, optional) – EPSG factory code for desired output Coordinate Reference System datum. The default is 4326, for the WGS84 Datum used by WQP queries.
intermediate_columns (Boolean, optional) – Return intermediate columns. Default ‘False’ does not return these.
**kwargs (optional) – Accepts crs_col, lat_col, and lon_col parameters if non-default:
crs_col (str, optional) – Name of column in DataFrame with the Coordinate Reference System datum. The default is ‘HorizontalCoordinateReferenceSystemDatumName’.
lat_col (str, optional) – Name of column in DataFrame with the latitude coordinate. The default is ‘LatitudeMeasure’.
lon_col (str, optional) – Name of column in DataFrame with the longitude coordinate. The default is ‘LongitudeMeasure’.
- Returns:
gdf – GeoDataFrame of df_in with coordinates in out_EPSG datum.
- Return type:
Examples
Build pandas DataFrame to use in example:
>>> df_in = pandas.DataFrame( ... { ... "LatitudeMeasure": [27.5950355, 27.52183, 28.0661111], ... "LongitudeMeasure": [-82.0300865, -82.64476, -82.3775], ... "HorizontalCoordinateReferenceSystemDatumName": ... ["NAD83", "WGS84", "NAD27"], ... } ... ) >>> df_in LatitudeMeasure ... HorizontalCoordinateReferenceSystemDatumName 0 27.595036 ... NAD83 1 27.521830 ... WGS84 2 28.066111 ... NAD27 [3 rows x 3 columns]
>>> from harmonize_wq import location >>> location.harmonize_locations(df_in) LatitudeMeasure LongitudeMeasure ... QA_flag geometry 0 27.595036 -82.030086 ... NaN POINT (-82.03009 27.59504) 1 27.521830 -82.644760 ... NaN POINT (-82.64476 27.52183) 2 28.066111 -82.377500 ... NaN POINT (-82.37750 28.06611) [3 rows x 5 columns]
- harmonize_wq.location.infer_CRS(df_in, out_EPSG, out_col='EPSG', bad_crs_val=None, crs_col='HorizontalCoordinateReferenceSystemDatumName')
Replace missing or unrecognized Coordinate Reference System (CRS).
Replaces with desired CRS and notes it was missing in ‘QA_flag’ column.
- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be updated.
out_EPSG (str) – Desired CRS to use.
out_col (str, optional) – Column in df to write out_EPSG to. The default is ‘EPSG’.
bad_crs_val (str, optional) – Bad Coordinate Reference System (CRS) datum name value to replace. The default is None for missing datum.
crs_col (str, optional) – Datum column in df_in. The default is ‘HorizontalCoordinateReferenceSystemDatumName’.
- Returns:
df_out – Updated copy of df_in.
- Return type:
Examples
Build pandas DataFrame to use in example, where crs_col name is ‘Datum’ rather than default ‘HorizontalCoordinateReferenceSystemDatumName’:
>>> from numpy import nan >>> df_in = pandas.DataFrame({'Datum': ['NAD83', 'WGS84', '', None, nan]}) >>> df_in Datum 0 NAD83 1 WGS84 2 3 None 4 NaN
>>> from harmonize_wq import location >>> location.infer_CRS(df_in, out_EPSG=4326, crs_col='Datum') ... Datum QA_flag EPSG 0 NAD83 NaN NaN 1 WGS84 NaN NaN 2 NaN NaN 3 None Datum: MISSING datum, EPSG:4326 assumed 4326.0 4 NaN Datum: MISSING datum, EPSG:4326 assumed 4326.0
NOTE: missing (NaN) and bad CRS values (bad_crs_val=None) are given an EPSG and noted in QA_flag’ columns.
- harmonize_wq.location.transform_vector_of_points(df_in, datum, out_EPSG)
Transform points by vector (sub-sets points by EPSG==datum).
- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be updated.
datum (int) – Current datum (EPSG code) to transform.
out_EPSG (int) – EPSG factory code for desired output Coordinate Reference System datum.
- Returns:
df – Updated copy of df_in.
- Return type:
harmonize_wq.visualize module
Functions to help visualize data.
- harmonize_wq.visualize.map_counts(df_in, gdf, col=None)
Get GeoDataFrame summarized by count of results for each station.
- Parameters:
df_in (pandas.DataFrame) – DataFrame with subset of results.
gdf (geopandas.GeoDataFrame) – GeoDataFrame with monitoring locations.
col (str, optional) – Column in df_in to aggregate results to in addition to location. The default is None, where results are only aggregated on location.
- Returns:
GeoDataFrame with count of results for each station
- Return type:
Examples
Build example DataFrame of results:
>>> from pandas import DataFrame >>> df_in = DataFrame({'ResultMeasureValue': [5.1, 1.2, 8.7], ... 'MonitoringLocationIdentifier': ['ID1', 'ID2', 'ID1'] ... }) >>> df_in ResultMeasureValue MonitoringLocationIdentifier 0 5.1 ID1 1 1.2 ID2 2 8.7 ID1
Build example GeoDataFrame of monitoring locations:
>>> import geopandas >>> from shapely.geometry import Point >>> from numpy import nan >>> d = {'MonitoringLocationIdentifier': ['ID1', 'ID2'], ... 'QA_flag': [nan, nan], ... 'geometry': [Point(1, 2), Point(2, 1)]} >>> gdf = geopandas.GeoDataFrame(d, crs="EPSG:4326") >>> gdf MonitoringLocationIdentifier QA_flag geometry 0 ID1 NaN POINT (1.00000 2.00000) 1 ID2 NaN POINT (2.00000 1.00000)
Combine these to get an aggregation of results per station:
>>> import harmonize_wq >>> cnt_gdf = harmonize_wq.visualize.map_counts(df_in, gdf) >>> cnt_gdf MonitoringLocationIdentifier cnt geometry QA_flag 0 ID1 2 POINT (1.00000 2.00000) NaN 1 ID2 1 POINT (2.00000 1.00000) NaN
These aggregate results can then be plotted:
>>> cnt_gdf.plot(column='cnt', cmap='Blues', legend=True) <Axes: >
- harmonize_wq.visualize.map_measure(df_in, gdf, col)
Get GeoDataFrame summarized by average of results for each station.
geopandas.GeoDataFrame
will have new column ‘mean’ with the average of col values for that location.- Parameters:
df_in (pandas.DataFrame) – DataFrame with subset of results.
gdf (geopandas.GeoDataFrame) – GeoDataFrame with monitoring locations.
col (str) – Column name in df_in to aggregate results for.
- Returns:
GeoDataFrame with average value of results for each station.
- Return type:
Examples
Build array of pint Quantity for Temperature:
>>> from pint import Quantity >>> u = 'degree_Celsius' >>> temperatures = [Quantity(5.1, u), Quantity(1.2, u), Quantity(8.7, u)]
Build example pandas DataFrame of results:
>>> from pandas import DataFrame >>> df_in = DataFrame({'Temperature': temperatures, ... 'MonitoringLocationIdentifier': ['ID1', 'ID2', 'ID1'] ... }) >>> df_in Temperature MonitoringLocationIdentifier 0 5.1 degree_Celsius ID1 1 1.2 degree_Celsius ID2 2 8.7 degree_Celsius ID1
Build example geopandas GeoDataFrame of monitoring locations:
>>> import geopandas >>> from shapely.geometry import Point >>> from numpy import nan >>> d = {'MonitoringLocationIdentifier': ['ID1', 'ID2'], ... 'QA_flag': [nan, nan], ... 'geometry': [Point(1, 2), Point(2, 1)]} >>> gdf = geopandas.GeoDataFrame(d, crs="EPSG:4326") >>> gdf MonitoringLocationIdentifier QA_flag geometry 0 ID1 NaN POINT (1.00000 2.00000) 1 ID2 NaN POINT (2.00000 1.00000)
Combine these to get an aggregation of results per station:
>>> from harmonize_wq import visualize >>> avg_temp = visualize.map_measure(df_in, gdf, 'Temperature') >>> avg_temp MonitoringLocationIdentifier cnt mean geometry QA_flag 0 ID1 2 6.9 POINT (1.00000 2.00000) NaN 1 ID2 1 1.2 POINT (2.00000 1.00000) NaN
These aggregate results can then be plotted:
>>> avg_temp.plot(column='mean', cmap='Blues', legend=True) <Axes: >
- harmonize_wq.visualize.print_report(results_in, out_col, unit_col_in, threshold=None)
Print a standardized report of changes made.
- Parameters:
results_in (pandas.DataFrame) – DataFrame with subset of results.
out_col (str) – Name of column in results_in with final result.
unit_col_in (str) – Name of column with original units.
threshold (dict, optional) – Dictionary with min and max keys. The default is None.
- Return type:
None.
See also
See any of the ‘Detailed’ notebooks found in demos for examples of how this function is leveraged by the
harmonize.harmonize_generic()
report argument.
- harmonize_wq.visualize.station_summary(df_in, col)
Get summary table for stations.
Summary table as
DataFrame
with rows for each station, count, and column average.- Parameters:
df_in (pandas.DataFrame) – DataFrame with results to summarize.
col (str) – Column name in df_in to summarize results for.
- Returns:
Table with result count and average summarized by station.
- Return type:
harmonize_wq.wq_data module
Class for harmonizing data retrieved from EPA’s Water Quality Portal.
- class harmonize_wq.wq_data.WQCharData(df_in, char_val)
Bases:
object
Class for specific characteristic in Water Quality Portal results.
- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be updated.
char_val (str) – Expected value in ‘CharacteristicName’ column.
- df
DataFrame with results for the specific characteristic.
- Type:
- c_mask
Row conditional (bool) mask to limit df rows to only those for the specific characteristic.
- Type:
- col
Standard WQCharData.df column names for unit_in, unit_out, and measure.
- Type:
- ureg
pint unit registry, initially standard unit registry.
- Type:
- units
Units all results in out_col column will be converted into. Default units are returned from
domains.OUT_UNITS()
[out_col].- Type:
Examples
Build pandas DataFrame to use as input:
>>> from pandas import DataFrame >>> from numpy import nan >>> df = DataFrame({'CharacteristicName': ['Phosphorus', 'Temperature, water',], ... 'ResultMeasure/MeasureUnitCode': [nan, nan], ... 'ResultMeasureValue': ['1.0', '10.0',], ... }) >>> df CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue 0 Phosphorus NaN 1.0 1 Temperature, water NaN 10.0
>>> from harmonize_wq import wq_data >>> wq = wq_data.WQCharData(df, 'Phosphorus') >>> wq.df CharacteristicName ResultMeasure/MeasureUnitCode ... Units Phosphorus 0 Phosphorus NaN ... NaN 1.0 1 Temperature, water NaN ... NaN NaN [2 rows x 5 columns]
>>> wq.df.columns Index(['CharacteristicName', 'ResultMeasure/MeasureUnitCode', 'ResultMeasureValue', 'Units', 'Phosphorus'], dtype='object')
- apply_conversion(convert_fun, unit, u_mask=None)
Apply special dimension changing conversions.
This uses functions in convert module and apply them across all cases of current unit.
- Parameters:
convert_fun (function) – Conversion function to apply.
unit (str) – Current unit.
u_mask (pandas.Series, optional) – Mask to use to identify what is being converted. The default is None, creating a unit mask based on unit.
- Return type:
None.
Examples
Build pandas DataFrame to use as input:
>>> from pandas import DataFrame >>> df = DataFrame( ... { ... 'CharacteristicName': [ ... 'Dissolved oxygen (DO)', ... 'Dissolved oxygen (DO)', ... ], ... 'ResultMeasure/MeasureUnitCode': ['mg/l', '%'], ... 'ResultMeasureValue': ['1.0', '10.0',], ... } ... ) >>> df CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue 0 Dissolved oxygen (DO) mg/l 1.0 1 Dissolved oxygen (DO) % 10.0
Build WQ Characteristic Data class from pandas DataFrame:
>>> from harmonize_wq import wq_data >>> wq = wq_data.WQCharData(df, 'Dissolved oxygen (DO)') >>> wq.apply_conversion(convert.DO_saturation, '%') >>> wq.df[['Units', 'DO']] Units DO 0 mg/l 1.000000 1 milligram / liter 0.008262
- check_basis(basis_col='MethodSpecificationName')
Determine speciation (basis) for measure.
- Parameters:
basis_col (str, optional) – Basis column name. Default is ‘MethodSpecificationName’ which is replaced by ‘Speciation’. Other columns are updated in place.
- Return type:
None.
Examples
Build DataFrame to use as input:
>>> from pandas import DataFrame >>> from numpy import nan >>> df = DataFrame( ... { ... "CharacteristicName": [ ... "Phosphorus", ... "Temperature, water", ... "Phosphorus", ... ], ... "ResultMeasure/MeasureUnitCode": ["mg/l as P", nan, "mg/l",], ... "ResultMeasureValue": ["1.0", "67.0", "10",], ... "MethodSpecificationName": [nan, nan, "as PO4",], ... } ... ) >>> df[['ResultMeasure/MeasureUnitCode', 'MethodSpecificationName']] ResultMeasure/MeasureUnitCode MethodSpecificationName 0 mg/l as P NaN 1 NaN NaN 2 mg/l as PO4
Build WQ Characteristic Data class from pandas DataFrame:
>>> from harmonize_wq import wq_data >>> wq = wq_data.WQCharData(df, 'Phosphorus') >>> wq.df.columns Index(['CharacteristicName', 'ResultMeasure/MeasureUnitCode', 'ResultMeasureValue', 'MethodSpecificationName', 'Units', 'Phosphorus'], dtype='object')
Run check_basis method to speciation for phosphorus:
>>> wq.check_basis() >>> wq.df[['MethodSpecificationName', 'Speciation']] MethodSpecificationName Speciation 0 NaN P 1 NaN NaN 2 as PO4 PO4
Note where basis was part of ‘ResultMeasure/MeasureUnitCode’ it has been removed in ‘Units’:
>>> wq.df.iloc[0] CharacteristicName Phosphorus ResultMeasure/MeasureUnitCode mg/l as P ResultMeasureValue 1.0 MethodSpecificationName NaN Units mg/l Phosphorus 1.0 Speciation P Name: 0, dtype: object
- check_units(flag_col=None)
Check units.
Checks for bad units that are missing (assumes default_unit) or unrecognized as valid by unit registry (ureg). Does not check for units in the correct dimensions, or a mistaken identity (e.g. ‘deg F’ recognized as ‘degree * farad’).
- Parameters:
flag_col (str, optional) – Column to reference in string for ‘QA_flags’. The default None uses WQCharData.col.unit_out attribute.
- Return type:
None.
Examples
Build DataFrame to use as input:
>>> from pandas import DataFrame >>> from numpy import nan >>> df = DataFrame( ... { ... "CharacteristicName": [ ... "Phosphorus", ... "Temperature, water", ... "Phosphorus", ... ], ... "ResultMeasure/MeasureUnitCode": [ ... nan, ... nan, ... "Unknown", ... ], ... "ResultMeasureValue": [ ... "1.0", ... "67.0", ... "10", ... ], ... } ... ) >>> df CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue 0 Phosphorus NaN 1.0 1 Temperature, water NaN 67.0 2 Phosphorus Unknown 10
Build WQ Characteristic Data class from pandas DataFrame:
>>> from harmonize_wq import wq_data >>> wq = wq_data.WQCharData(df, 'Phosphorus') >>> wq.df.Units 0 NaN 1 NaN 2 Unknown Name: Units, dtype: object
Run check_units method to replace bad or missing units for phosphorus:
>>> wq.check_units() UserWarning: WARNING: 'Unknown' UNDEFINED UNIT for Phosphorus
>>> wq.df[['CharacteristicName', 'Units', 'QA_flag']] CharacteristicName Units QA_flag 0 Phosphorus mg/l ResultMeasure/MeasureUnitCode: MISSING UNITS, ... 1 Temperature, water NaN NaN 2 Phosphorus mg/l ResultMeasure/MeasureUnitCode: 'Unknown' UNDEF...
Note: it didn’t infer units for ‘Temperature, water’ because wq is Phosphorus specific.
- convert_units(default_unit=None, errors='raise')
Update out-col to convert units.
Update class out-col used to convert
pandas.DataFrame
. from old units to default_unit.- Parameters:
default_unit (str, optional) – Units to convert values to. Default None uses units attribute.
errors (str, optional) – Values of ‘ignore’, ‘raise’, or ‘skip’. The default is ‘raise’. If ‘raise’, invalid dimension conversions will raise an exception. If ‘skip’, invalid dimension conversions will not be converted. If ‘ignore’, invalid dimension conversions will be NaN.
- Return type:
None.
Examples
Build pandas DataFrame to use as input:
>>> from pandas import DataFrame >>> df = DataFrame({'CharacteristicName': ['Phosphorus', 'Temperature, water',], ... 'ResultMeasure/MeasureUnitCode': ['mg/ml', 'deg C'], ... 'ResultMeasureValue': ['1.0', '10.0',], ... }) >>> df CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue 0 Phosphorus mg/ml 1.0 1 Temperature, water deg C 10.0
Build WQ Characteristic Data class from pandas DataFrame:
>>> from harmonize_wq import wq_data >>> wq = wq_data.WQCharData(df, 'Phosphorus')
>>> wq.convert_units() >>> wq.df[['ResultMeasureValue', 'Units', 'Phosphorus']] ResultMeasureValue Units Phosphorus 0 1.0 mg/ml 1000.0000000000001 milligram / liter 1 10.0 NaN NaN
- dimension_fixes()
Input/output for dimension handling.
Result dictionary key is old_unit and value is equation to get it into the desired dimension. Result list has substance to include as part of unit.
Notes
These are next processed interactively, one dimension at a time, except for mole conversions which are further split by basis (one at a time).
- Returns:
dimension_dict (
dict
) – Dictionary with old_unit:new_unit.mol_list (
list
) – List of Mole (substance) units.
Examples
Build pandas DataFrame to use as input:
>>> from pandas import DataFrame >>> df = DataFrame({'CharacteristicName': ['Phosphorus', 'Phosphorus',], ... 'ResultMeasure/MeasureUnitCode': ['mg/l', 'mg/kg',], ... 'ResultMeasureValue': ['1.0', '10',], ... }) >>> df CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue 0 Phosphorus mg/l 1.0 1 Phosphorus mg/kg 10
Build WQ Characteristic Data class from pandas DataFrame:
>>> from harmonize_wq import wq_data >>> wq = wq_data.WQCharData(df, 'Phosphorus')
>>> wq.dimension_fixes() ({'mg/kg': 'mg/kg * H2O'}, [])
- dimensions_list(m_mask=None)
Get list of unique unit dimensions.
- Parameters:
m_mask (pandas.Series, optional) – Conditional mask to limit rows. The default None, uses
measure_mask()
.- Returns:
List of units with mismatched dimensions.
- Return type:
Examples
Build pandas DataFrame to use as input:
>>> from pandas import DataFrame >>> df = DataFrame({'CharacteristicName': ['Phosphorus', 'Phosphorus',], ... 'ResultMeasure/MeasureUnitCode': ['mg/l', 'mg/kg',], ... 'ResultMeasureValue': ['1.0', '10',], ... }) >>> df CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue 0 Phosphorus mg/l 1.0 1 Phosphorus mg/kg 10
Build WQ Characteristic Data class from pandas DataFrame:
>>> from harmonize_wq import wq_data >>> wq = wq_data.WQCharData(df, 'Phosphorus')
>>> wq.dimensions_list() ['mg/kg']
- fraction(frac_dict=None, catch_all=None, suffix=None, fract_col='ResultSampleFractionText')
Create columns for sample fractions using frac_dict to set names.
- Parameters:
frac_dict (dict, optional) – Dictionary where {fraction_name : new_col}. The default None starts with an empty dictionary.
catch_all (str, optional) – Name for new field to map sample fractions not mapped by frac_dict
suffix (str, optional) – String to add to the end of any new column name. The default None, uses out_col attribute.
fract_col (str, optional) – Column name where sample fraction is defined. The default is ‘ResultSampleFractionText’.
- Returns:
frac_dict – frac_dict updated to include any fract_col not already defined.
- Return type:
Examples
Build pandas DataFrame to use as input:
>>> from pandas import DataFrame >>> df = DataFrame({'CharacteristicName': ['Phosphorus', 'Phosphorus',], ... 'ResultMeasure/MeasureUnitCode': ['mg/l', 'mg/kg',], ... 'ResultMeasureValue': ['1.0', '10',], ... 'ResultSampleFractionText': ['Dissolved', ''], ... }) >>> df CharacteristicName ... ResultSampleFractionText 0 Phosphorus ... Dissolved 1 Phosphorus ... [2 rows x 4 columns]
Build WQ Characteristic Data class from pandas DataFrame:
>>> from harmonize_wq import wq_data >>> wq = wq_data.WQCharData(df, 'Phosphorus')
Go through required checks and conversions
>>> wq.check_units() >>> dimension_dict, mol_list = wq.dimension_fixes() >>> wq.replace_unit_by_dict(dimension_dict, wq.measure_mask()) >>> wq.moles_convert(mol_list) >>> wq.convert_units() >>> wq.df.columns Index(['CharacteristicName', 'ResultMeasure/MeasureUnitCode', 'ResultMeasureValue', 'ResultSampleFractionText', 'Units', 'Phosphorus', 'QA_flag'], dtype='object') >>> wq.df['Phosphorus'] 0 1.0 milligram / liter 1 10.000000000000002 milligram / liter Name: Phosphorus, dtype: object
These results may have differen, non-comprable sample fractions. First, split results using a provided frac_dict (as used in harmonize()):
>>> from numpy import nan >>> frac_dict = {'TP_Phosphorus': ['Total'], ... 'TDP_Phosphorus': ['Dissolved'], ... 'Other_Phosphorus': ['', nan],} >>> wq.fraction(frac_dict) >>> wq.df.columns Index(['CharacteristicName', 'ResultMeasure/MeasureUnitCode', 'ResultMeasureValue', 'ResultSampleFractionText', 'Units', 'Phosphorus', 'QA_flag', 'TDP_Phosphorus', 'Other_Phosphorus'], dtype='object') >>> wq.df[['TDP_Phosphorus', 'Other_Phosphorus']] TDP_Phosphorus Other_Phosphorus 0 1.0 milligram / liter NaN 1 NaN 10.000000000000002 milligram / liter
Alternatively, the sample fraction lists from tada can be used, in this case they are added:
>>> wq.fraction('TADA') >>> wq.df.columns Index(['CharacteristicName', 'ResultMeasure/MeasureUnitCode', 'ResultMeasureValue', 'ResultSampleFractionText', 'Units', 'Phosphorus', 'QA_flag', 'TDP_Phosphorus', 'Other_Phosphorus', 'TOTAL PHOSPHORUS_ MIXED FORMS'], dtype='object') >>> wq.df[['TOTAL PHOSPHORUS_ MIXED FORMS', 'Other_Phosphorus']] TOTAL PHOSPHORUS_ MIXED FORMS Other_Phosphorus 0 1.0 milligram / liter NaN 1 NaN 10.000000000000002 milligram / liter
- measure_mask(column=None)
Get mask for characteristic and valid measure.
Mask is characteristic specific (c_mask) and only has valid col measures (Non-NA).
- Parameters:
column (str, optional) – DataFrame column name to use. Default None uses WQCharData.out_col attribute.
- Return type:
None.
Examples
Build DataFrame to use as input:
>>> from pandas import DataFrame >>> from numpy import nan >>> df = DataFrame( ... { ... 'CharacteristicName': [ ... 'Phosphorus', ... 'Temperature, water', ... 'Phosphorus', ... 'Phosphorus', ... ], ... 'ResultMeasure/MeasureUnitCode': ['mg/l as P', nan, 'mg/l', 'mg/l',], ... 'ResultMeasureValue': ['1.0', '67.0', '10', 'None'], ... }) >>> df CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue 0 Phosphorus mg/l as P 1.0 1 Temperature, water NaN 67.0 2 Phosphorus mg/l 10 3 Phosphorus mg/l None
Build WQ Characteristic Data class from pandas DataFrame:
>>> from harmonize_wq import wq_data >>> wq = wq_data.WQCharData(df, 'Phosphorus')
Check measure mask:
>>> wq.measure_mask() 0 True 1 False 2 True 3 False dtype: bool
- moles_convert(mol_list)
Update out_col with moles converted and reduce unit_col to units.
- Parameters:
mol_list (list) – List of Mole (substance) units.
- Return type:
None.
Examples
Build pandas DataFrame to use as input:
>>> from pandas import DataFrame >>> from numpy import nan >>> df = DataFrame({'CharacteristicName': ['Organic carbon', 'Organic carbon',], ... 'ResultMeasure/MeasureUnitCode': ['mg/l', 'umol',], ... 'ResultMeasureValue': ['1.0', '0.265',], ... 'MethodSpecificationName': [nan, nan,], ... }) >>> df[['ResultMeasure/MeasureUnitCode', 'ResultMeasureValue']] ResultMeasure/MeasureUnitCode ResultMeasureValue 0 mg/l 1.0 1 umol 0.265
Build WQ Characteristic Data class from pandas DataFrame:
>>> from harmonize_wq import wq_data >>> wq = wq_data.WQCharData(df, 'Organic carbon') >>> wq.df CharacteristicName ResultMeasure/MeasureUnitCode ... Units Carbon 0 Organic carbon mg/l ... mg/l 1.000 1 Organic carbon umol ... umol 0.265 [2 rows x 6 columns]
Run required checks:
>>> wq.check_basis() >>> wq.check_units()
Assemble dimensions dict and moles list:
>>> dimension_dict, mol_list = wq.dimension_fixes() >>> dimension_dict {'umol': '0.00018015999999999998 gram / l'} >>> mol_list ['0.00018015999999999998 gram / l']
Replace units by dimension_dict:
>>> wq.replace_unit_by_dict(dimension_dict, wq.measure_mask()) >>> wq.df[['Units', 'Carbon']] Units Carbon 0 mg/l 1.000 1 0.00018015999999999998 gram / l 0.265
Convert Carbon measure into whole units:
>>> wq.moles_convert(mol_list) >>> wq.df[['Units', 'Carbon']] Units Carbon 0 mg/l 1.000000 1 gram / liter 0.000048
This allows final conversion without dimensionality issues:
>>> wq.convert_units() >>> wq.df['Carbon'] 0 1.0 milligram / liter 1 0.0477424 milligram / liter Name: Carbon, dtype: object
- replace_unit_by_dict(val_dict, mask=None)
Do multiple replace_in_col() replacements using val_dict.
Replaces instances of val_dict key with val_dict value.
- Parameters:
val_dict (dict) – Occurrences of key in the unit column are replaced with the value.
mask (pandas.Series, optional) – Conditional mask to limit rows. The default None, uses the c_mask attribute.
- Return type:
None.
Examples
Build pandas DataFrame to use as input:
>>> from pandas import DataFrame >>> df = DataFrame({'CharacteristicName': ['Fecal Coliform', 'Fecal Coliform',], ... 'ResultMeasure/MeasureUnitCode': ['#/100ml', 'MPN',], ... 'ResultMeasureValue': ['1.0', '10',], ... }) >>> df CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue 0 Fecal Coliform #/100ml 1.0 1 Fecal Coliform MPN 10
Build WQ Characteristic Data class from pandas DataFrame:
>>> from harmonize_wq import wq_data >>> wq = wq_data.WQCharData(df, 'Fecal Coliform') >>> wq.df CharacteristicName ResultMeasure/MeasureUnitCode ... Units Fecal_Coliform 0 Fecal Coliform #/100ml ... #/100ml 1.0 1 Fecal Coliform MPN ... MPN 10.0 [2 rows x 5 columns]
>>> wq.replace_unit_by_dict(domains.UNITS_REPLACE['Fecal_Coliform']) >>> wq.df CharacteristicName ResultMeasure/MeasureUnitCode ... Units Fecal_Coliform 0 Fecal Coliform #/100ml ... CFU/(100ml) 1.0 1 Fecal Coliform MPN ... MPN/(100ml) 10.0 [2 rows x 5 columns]
- replace_unit_str(old, new, mask=None)
Replace ALL instances of old with in WQCharData.col.unit_out column.
- Parameters:
old (str) – Sub-string to find and replace.
new (str) – Sub-string to replace old sub-string.
mask (pandas.Series, optional) – Conditional mask to limit rows. The default None, uses the c_mask attribute.
Examples
Build pandas DataFrame to use as input:
>>> from pandas import DataFrame >>> df = DataFrame( ... { ... "CharacteristicName": ["Temperature, water", "Temperature, water",], ... "ResultMeasure/MeasureUnitCode": ["deg C", "deg F",], ... "ResultMeasureValue": ["31", "87",], ... } ... ) >>> df CharacteristicName ResultMeasure/MeasureUnitCode ResultMeasureValue 0 Temperature, water deg C 31 1 Temperature, water deg F 87
Build WQ Characteristic Data class from pandas DataFrame:
>>> from harmonize_wq import wq_data >>> wq = wq_data.WQCharData(df, 'Temperature, water') >>> wq.df[['ResultMeasure/MeasureUnitCode', 'Units', 'Temperature']] ResultMeasure/MeasureUnitCode Units Temperature 0 deg C deg C 31 1 deg F deg F 87
>>> wq.replace_unit_str(' ', '') >>> wq.df[['ResultMeasure/MeasureUnitCode', 'Units', 'Temperature']] ResultMeasure/MeasureUnitCode Units Temperature 0 deg C degC 31 1 deg F degF 87
- update_units(units_out)
Update class units attribute to convert everything into.
This just updates the attribute, it does not perform the conversion.
- Parameters:
units_out (str) – Units to convert results into.
- Return type:
None.
Examples
Build WQ Characteristic Data class:
>>> from harmonize_wq import wq_data >>> wq = wq_data.WQCharData(df, 'Phosphorus') >>> wq.units 'mg/l'
>>> wq.update_units('mg/kg') >>> wq.units 'mg/kg'
- update_ureg()
Update class unit registry to define units based on out_col.
- harmonize_wq.wq_data.units_dimension(series_in, units, ureg=None)
List unique units not in desired units dimension.
- Parameters:
series_in (pandas.Series) – Series of units.
units (str) – Desired units.
ureg (pint.UnitRegistry, optional) – Unit Registry Object with any custom units defined. The default is None.
- Returns:
dim_list – List of units with mismatched dimensions.
- Return type:
Examples
Build series to use as input:
>>> from pandas import Series >>> unit_series = Series(['mg/l', 'mg/ml', 'g/kg']) >>> unit_series 0 mg/l 1 mg/ml 2 g/kg dtype: object
Get list of unique units not in desired units dimension ‘mg/l’:
>>> from harmonize_wq import wq_data >>> wq_data.units_dimension(unit_series, units='mg/l') ['g/kg']
harmonize_wq.wrangle module
Functions to help re-shape the WQP pandas DataFrame.
- harmonize_wq.wrangle.add_activities_to_df(df_in, mask=None)
Add activities to DataFrame.
- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be updated.
mask (pandas.Series) – Row conditional mask to sub-set rows to get activities for. The default None, uses the entire set.
- Returns:
df_merged – Table with added info from activities table by location id.
- Return type:
Examples
Build example df_in table from harmonize_wq tests to use in place of Water Quality Portal query response, this table has ‘Temperature, water’ and ‘Phosphorous’ results:
>>> import pandas >>> tests_url = 'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests' >>> df1 = pandas.read_csv(tests_url + '/data/wqp_results.txt') >>> df1.shape (359505, 35)
Run on the first 1000 results:
>>> df2 = df1[:1000]
>>> from harmonize_wq import wrangle >>> df_activities = wrangle.add_activities_to_df(df2) >>> df_activities.shape (1000, 100)
Look at the columns added:
>>> df_activities.columns[-65:] Index(['ActivityTypeCode', 'ActivityMediaName', 'ActivityMediaSubdivisionName', 'ActivityEndDate', 'ActivityEndTime/Time', 'ActivityEndTime/TimeZoneCode', 'ActivityRelativeDepthName', 'ActivityDepthHeightMeasure/MeasureValue', 'ActivityDepthHeightMeasure/MeasureUnitCode', 'ActivityDepthAltitudeReferencePointText', 'ActivityTopDepthHeightMeasure/MeasureValue', 'ActivityTopDepthHeightMeasure/MeasureUnitCode', 'ActivityBottomDepthHeightMeasure/MeasureValue', 'ActivityBottomDepthHeightMeasure/MeasureUnitCode', 'ProjectIdentifier', 'ActivityConductingOrganizationText', 'ActivityCommentText', 'SampleAquifer', 'HydrologicCondition', 'HydrologicEvent', 'ActivityLocation/LatitudeMeasure', 'ActivityLocation/LongitudeMeasure', 'ActivityLocation/SourceMapScaleNumeric', 'ActivityLocation/HorizontalAccuracyMeasure/MeasureValue', 'ActivityLocation/HorizontalAccuracyMeasure/MeasureUnitCode', 'ActivityLocation/HorizontalCollectionMethodName', 'ActivityLocation/HorizontalCoordinateReferenceSystemDatumName', 'AssemblageSampledName', 'CollectionDuration/MeasureValue', 'CollectionDuration/MeasureUnitCode', 'SamplingComponentName', 'SamplingComponentPlaceInSeriesNumeric', 'ReachLengthMeasure/MeasureValue', 'ReachLengthMeasure/MeasureUnitCode', 'ReachWidthMeasure/MeasureValue', 'ReachWidthMeasure/MeasureUnitCode', 'PassCount', 'NetTypeName', 'NetSurfaceAreaMeasure/MeasureValue', 'NetSurfaceAreaMeasure/MeasureUnitCode', 'NetMeshSizeMeasure/MeasureValue', 'NetMeshSizeMeasure/MeasureUnitCode', 'BoatSpeedMeasure/MeasureValue', 'BoatSpeedMeasure/MeasureUnitCode', 'CurrentSpeedMeasure/MeasureValue', 'CurrentSpeedMeasure/MeasureUnitCode', 'ToxicityTestType', 'SampleCollectionMethod/MethodIdentifier', 'SampleCollectionMethod/MethodIdentifierContext', 'SampleCollectionMethod/MethodName', 'SampleCollectionMethod/MethodQualifierTypeName', 'SampleCollectionMethod/MethodDescriptionText', 'SampleCollectionEquipmentName', 'SampleCollectionMethod/SampleCollectionEquipmentCommentText', 'SamplePreparationMethod/MethodIdentifier', 'SamplePreparationMethod/MethodIdentifierContext', 'SamplePreparationMethod/MethodName', 'SamplePreparationMethod/MethodQualifierTypeName', 'SamplePreparationMethod/MethodDescriptionText', 'SampleContainerTypeName', 'SampleContainerColorName', 'ChemicalPreservativeUsedName', 'ThermalPreservativeUsedName', 'SampleTransportStorageDescription', 'ActivityMetricUrl'], dtype='object')
- harmonize_wq.wrangle.add_detection(df_in, char_val)
Add detection quantitation information for results where available.
- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be updated.
char_val (str) – Specific characteristic name to apply to.
- Returns:
df_merged – Table with added info from detection quantitation table columns.
- Return type:
Examples
Build example df_in table from harmonize_wq tests to use in place of Water Quality Portal query response, this table has ‘Temperature, water’ and ‘Phosphorous’ results:
>>> import pandas >>> tests_url = 'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests' >>> df1 = pandas.read_csv(tests_url + '/data/wqp_results.txt') >>> df1.shape (359505, 35)
Run on the 1000 results to speed it up:
>>> df2 = df1[19000:20000] >>> df2.shape (1000, 35)
>>> from harmonize_wq import wrangle >>> df_detects = wrangle.add_detection(df2, 'Phosphorus') >>> df_detects.shape (1001, 38)
Note: the additional rows are due to one result being able to be assigned multiple detection results. This is not the case for e.g., df1[:1000]
Look at the columns added:
>>> df_detects.columns[-3:] Index(['DetectionQuantitationLimitTypeName', 'DetectionQuantitationLimitMeasure/MeasureValue', 'DetectionQuantitationLimitMeasure/MeasureUnitCode'], dtype='object')
- harmonize_wq.wrangle.as_gdf(shp)
Get a GeoDataFrame for shp if shp is not already a GeoDataFrame.
- Parameters:
shp (str) – Filename for something that needs to be a GeoDataFrame.
- Returns:
shp – GeoDataFrame for shp if it isn’t already a GeoDataFrame.
- Return type:
Examples
Use area of interest GeoJSON for Pensacola and Perdido Bays, FL from harmonize_wq tests:
>>> from harmonize_wq import wrangle >>> aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson' >>> type(wrangle.as_gdf(aoi_url)) <class 'geopandas.geodataframe.GeoDataFrame'>
- harmonize_wq.wrangle.clip_stations(stations, aoi)
Clip stations to area of interest (aoi).
Locations and results are queried by extent rather than the exact geometry. Clipping by the exact geometry helps reduce the size of the results.
Notes
aoi is first transformed to CRS of stations.
- Parameters:
stations (geopandas.GeoDataFrame) – Points representing the stations.
aoi (geopandas.GeoDataFrame) – Polygon representing the area of interest.
- Returns:
stations_gdf points clipped to the aoi_gdf.
- Return type:
Examples
Build example geopandas GeoDataFrame of locations for stations:
>>> import geopandas >>> from shapely.geometry import Point >>> from numpy import nan >>> d = {'MonitoringLocationIdentifier': ['In', 'Out'], ... 'geometry': [Point (-87.1250, 30.50000), ... Point (-87.5000, 30.50000),]} >>> stations_gdf = geopandas.GeoDataFrame(d, crs="EPSG:4326") >>> stations_gdf MonitoringLocationIdentifier geometry 0 In POINT (-87.12500 30.50000) 1 Out POINT (-87.50000 30.50000)
Use area of interest GeoJSON for Pensacola and Perdido Bays, FL from harmonize_wq tests:
>>> aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson'
>>> stations_in_aoi = harmonize_wq.wrangle.clip_stations(stations_gdf, aoi_url) >>> stations_in_aoi MonitoringLocationIdentifier geometry 0 In POINT (-87.12500 30.50000)
- harmonize_wq.wrangle.collapse_results(df_in, cols=None)
Group rows/results that seems like the same sample.
Default columns are organization, activity, location, and datetime.
- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be updated.
cols (list, optional) – Columns to consider. The default is None.
- Returns:
df_indexed – Updated DataFrame.
- Return type:
Examples
See any of the ‘Simple’ notebooks found in demos for examples of how this function is used to combine rows with the same sample organization, activity, location, and datetime.
- harmonize_wq.wrangle.get_activities_by_loc(characteristic_names, locations)
Segment batch what_activities.
Warning this is not fully implemented and may not stay. Retrieves in batch using
dataretrieval.what_activities()
.- Parameters:
- Returns:
activities – Combined activities for locations.
- Return type:
Examples
See
wrangle.add_activities_to_df()
- harmonize_wq.wrangle.get_bounding_box(shp, idx=None)
Get bounding box for spatial file (shp).
- Parameters:
shp (spatial file) – Any geometry that is readable by geopandas.
idx (int, optional) – Index for geometry to get bounding box for. The default is None to return the total extent bounding box.
- Return type:
Coordinates for bounding box as string and separated by ‘, ‘.
Examples
Use area of interest GeoJSON for Pensacola and Perdido Bays, FL from harmonize_wq tests:
>>> from harmonize_wq import wrangle >>> aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson' >>> wrangle.get_bounding_box(aoi_url) '-87.72443263367131,30.27180869902194,-86.58972642899643,30.654976858733534'
- harmonize_wq.wrangle.get_detection_by_loc(loc_series, result_id_series, char_val=None)
Get detection quantitation by location and characteristic (optional).
Retrieves detection quantitation results by location and characteristic name (optional). ResultIdentifier can not be used to search. Instead location id from loc_series is used and then results are limited by ResultIdentifiers from result_id_series.
Notes
There can be multiple Result Detection Quantitation limits / result. A result may have a ResultIdentifier without any corresponding data in the Detection Quantitation limits table (NaN in return).
- Parameters:
loc_series (pandas.Series) – Series of location IDs to retrieve detection limits for.
result_id_series (pandas.Series) – Series of result IDs to limit retrieved data.
char_val (str, optional.) – Specific characteristic name to retrieve detection limits for. The default None, uses all ‘CharacteristicName’ values returned.
- Returns:
df_out – Detection Quantitation limits table corresponding to input arguments.
- Return type:
- harmonize_wq.wrangle.merge_tables(df1, df2, df2_cols='all', merge_cols='activity')
Merge df1 and df2.
Merge tables(df1 and df2), adding df2_cols to df1 where merge_cols match.
- Parameters:
df1 (pandas.DataFrame) – DataFrame that will be updated.
df2 (pandas.DataFrame) – DataFrame with new columns (df2_cols) that will be added to df1.
df2_cols (str, optional) – Columns in df2 to add to df1. The default is ‘all’, for all columns not already in df1.
merge_cols (str, optional) – Columns in both DataFrames to use in join. The default is ‘activity’, for a subset of columns in the activity df2.
- Returns:
merged_results – Updated copy of df1.
- Return type:
Examples
Build example table from harmonize_wq tests to use in place of Water Quality Portal query responses:
>>> import pandas >>> tests_url = 'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests' >>> df1 = pandas.read_csv(tests_url + '/data/wqp_results.txt') >>> df1.shape (359505, 35)
>>> df2 = pandas.read_csv(tests_url + '/data/wqp_activities.txt') >>> df2.shape (353911, 40)
>>> from harmonize_wq import wrangle >>> merged = wrangle.merge_tables(df1, df2) >>> merged.shape (359505, 67)
- harmonize_wq.wrangle.split_col(df_in, result_col='QA_flag', col_prefix='QA')
Move each row value from a column to a characteristic specific column.
Values are moved from the result_col in df_in to a new column where the column name is col_prefix + characteristic.
- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be updated.
result_col (str, optional) – Column name with results to split. The default is ‘QA_flag’.
col_prefix (str, optional) – Prefix to be added to new result column names. The default is ‘QA’.
- Returns:
df – Updated DataFrame.
- Return type:
Examples
See any of the ‘Simple’ notebooks found in demos for examples of how this function is used to split the QA column into multiple characteristic specific QA columns.
- harmonize_wq.wrangle.split_table(df_in)
Split DataFrame columns axis into main and characteristic based.
Splits
pandas.DataFrame
in two, one with main results columns and one with Characteristic based metadata.Notes
Runs
clean.datetime()
andclean.harmonize_depth()
if expected columns (‘Activity_datetime’ and ‘Depth’) are missing.- Parameters:
df_in (pandas.DataFrame) – DataFrame that will be used to generate results.
- Returns:
main_df (pandas.DataFrame) – DataFrame with main results.
chars_df (pandas.DataFrame) – DataFrame with Characteristic based metadata.
Examples
See any of the ‘Simple’ notebooks found in demos for examples of how this function is used to divide the table into columns of interest (main_df) and characteristic specific metadata (chars_df).
- harmonize_wq.wrangle.to_simple_shape(gdf, out_shp)
Simplify GeoDataFrame for better export to shapefile.
Adopts and adapts ‘Simple’ from NWQMC/pywqp See
domains.stations_rename()
for renaming of columns.- Parameters:
gdf (geopandas.GeoDataFrame) – The GeoDataFrame to be exported to shapefile.
shp_out (str) – Shapefile directory and file name to be written.
Examples
Build example geopandas GeoDataFrame of locations for stations:
>>> import geopandas >>> from shapely.geometry import Point >>> from numpy import nan >>> d = {'MonitoringLocationIdentifier': ['In', 'Out'], ... 'geometry': [Point (-87.1250, 30.50000), ... Point (-87.5000, 30.50000),]} >>> gdf = geopandas.GeoDataFrame(d, crs="EPSG:4326") >>> gdf MonitoringLocationIdentifier geometry 0 In POINT (-87.12500 30.50000) 1 Out POINT (-87.50000 30.50000)
Add datetime column
>>> gdf['ActivityStartDate'] = ['2004-09-01', '2004-02-18'] >>> gdf['ActivityStartTime/Time'] = ['10:01:00', '15:39:00'] >>> gdf['ActivityStartTime/TimeZoneCode'] = ['EST', 'EST'] >>> from harmonize_wq import clean >>> gdf = clean.datetime(gdf) >>> gdf MonitoringLocationIdentifier ... Activity_datetime 0 In ... 2004-09-01 15:01:00+00:00 1 Out ... 2004-02-18 20:39:00+00:00 [2 rows x 6 columns]
>>> from harmonize_wq import wrangle >>> wrangle.to_simple_shape(gdf, 'dataframe.shp')