The StreamCat Dataset provides summaries of natural and anthropogenic landscape features for ~2.65 million streams, and their associated catchments, within the conterminous USA. This repo contains code used in StreamCat to process a suite of landscape rasters to watersheds for streams and their associated catchments (local reach contributing area) within the conterminous USA using the NHDPlus Version 2 as the geospatial framework.
Users will need the following programs installed in order to run the code in the StreamCat GitHub repository:
Programs: Python, ArcGIS Pro (used to run ZonalStatisticsAsTable and TabulateArea tools with arcpy)
There are two options for installing the required dependencies. As shown below both create
a conda environment called streamcat
, although you can use any name you want.
You can use a Python package manager like miniforge or the conda package management that comes with ArcGIS Pro. You can use the package manager in ArcGIS Pro or conda package management available via the python command prompt with ArcGIS Pro (Start > ArcGIS > Python command prompt). Note that the version of arcpy should match what your ArcPro arcpy version is. At the conda command prompt the steps are:
We list these specific Python packages needed in the StreamCat code are listed in the streamcat.yml file in the StreamCat GitHub repository. Users can use this .yml file to create an environment with the necessary Python libraries by running the following lines at a conda prompt:
Create a local directory for your working files.
Make local copies of the NHDPlusV2 hydrology data and the StreamCat repository and store these in directories on your local machine.
The StreamCat GitHub repository includes a control table, a configuration file, and Python scripts needed for running metrics.
In turn, these scripts rely on a generic functions in StreamCat_functions.py. And pathways as described by stream_cat_config.py , which will need to be formated and saved as .py to fit your directories
To generate the riparian buffers we used in StreamCat we used the code in RiparianBuffers.py
To generate percent full for catchments on the US border for point features, we used the code in border.py
After editing the control tables to provide necessary information, such as directory paths, the following steps will execute processes to generate new watershed metrics for the conterminous US. This example uses Conda format within Spyder IDE.
Set ACCUM_DIR to “(project file director)/accum_npy/”
Once StreamCat.py has run
This year we have been making efforts to improve StreamCat’s code base to run more efficiently using modern computing techniques. We aim to tackle a few of the main areas where slow down occurs in the StreamCat pipeline. Namely initizlizing the Vectors for each Hydroregion and the accumulation process done for each StreamCat metric.
The biggest hurdle was the make_all_cat_comid function. We shifted from geopandas to pyogrio and used two parameters: read_geometry = false, and use_arrow = true as well as parallelizing the make_all_cat_comid function itself which reduced the overall runtime from 888.993 seconds (~15 minutes) to a grand total of 17 seconds when ran using 16 parallel processes with the Joblib package.
Next we tackeled the process_zone function. Originally it took 10 minutes to write 3 tiles and 20 minutes to write 15 files telling us there are some start up issues. When using the same pyogrio and joblib improvements described above the total time was reduced to 7 minutes.
Finally in the Accumulation process we could not parallelize the entire loop because we have to keep the downstream iteration through the Ordered Dictionary used in the original code / StreamCat algorithm. However we were able to parallize the for loop inside Accumulation function to achieve a 3x-7x speed up for both up and ws accumulation types (see notes below for specifics). Overall each metric is accumulated around 4.6x faster than before.
The United States Environmental Protection Agency (EPA) GitHub project code is provided on an “as is” basis and the user assumes responsibility for its use. EPA has relinquished control of the information and no longer has responsibility to protect the integrity , confidentiality, or availability of the information. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by EPA. The EPA seal and logo shall not be used in any manner to imply endorsement of any commercial product or activity by EPA or the United States Government.