Example Workflow
dataretrieval Query for a GeoJSON
import dataretrieval.wqp as wqp
from harmonize_wq import wrangle
# File for area of interest
aoi_url = r'https://github.com/USEPA/harmonize-wq/raw/main/harmonize_wq/tests/data/PPBays_NCCA.geojson'
# Build query
query = {'characteristicName': ['Temperature, water',
'Depth, Secchi disk depth',
]}
query['bBox'] = wrangle.get_bounding_box(aoi_url)
query['dataProfile'] = 'narrowResult'
# Run query
res_narrow, md_narrow = wqp.get_results(**query)
# DataFrame of downloaded results
res_narrow
Harmonize results
from harmonize_wq import harmonize
# Harmonize all results
df_harmonized = harmonize.harmonize_all(res_narrow, errors='raise')
df_harmonized
Clean results
from harmonize_wq import clean
# Clean up other columns of data
df_cleaned = clean.datetime(df_harmonized) # datetime
df_cleaned = clean.harmonize_depth(df_cleaned) # Sample depth
df_cleaned
Transform results from long to wide format
There are many columns in the pandas.DataFrame
that are characteristic specific, that is they have different values for the same sample depending on the characteristic.
To ensure one result for each sample after the transformation of the data these columns must either be split, generating a new column for each characteristic with values, or moved out from the table if not being used.
from harmonize_wq import wrangle
# Split QA column into multiple characteristic specific QA columns
df_full = wrangle.split_col(df_cleaned)
# Divide table into columns of interest (main_df) and characteristic specific metadata (chars_df)
main_df, chars_df = wrangle.split_table(df_full)
# Combine rows with the same sample organization, activity, location, and datetime
df_wide = wrangle.collapse_results(main_df)
The number of columns in the resulting table is greatly reduced:
Output Column |
Type |
Source |
Changes |
MonitoringLocationIdentifier |
Defines row |
MonitoringLocationIdentifier |
NA |
Activity_datetime |
Defines row |
ActivityStartDate ActivityStartTime/Time ActivityStartTime/TimeZoneCode |
Combined and UTC |
ActivityIdentifier |
Defines row |
ActivityIdentifier |
NA |
OrganizationIdentifier |
Defines row |
OrganizationIdentifier |
NA |
OrganizationFormalName |
Metadata |
OrganizationFormalName |
NA |
ProviderName |
Metadata |
ProviderName |
NA |
StartDate |
Metadata |
ActivityStartDate |
Preserves date where time NAT |
Depth |
Metadata |
ResultDepthHeightMeasure/MeasureValue ResultDepthHeightMeasure/MeasureUnitCode |
Standardized to meters |
Secchi |
Result |
ResultMeasureValue ResultMeasure/MeasureUnitCode |
Standardized to meters |
QA_Secchi |
QA |
NA |
Harmonization quality issues |
Temperature |
Result |
ResultMeasureValue ResultMeasure/MeasureUnitCode |
Standardized to degrees Celsius |
QA_Temperature |
QA |
NA |
Harmonization quality issues |
For more complete tutorial information, see: demos