Updating EJAM Datasets

The EJAM package and Shiny app make use of many data objects, including numerous datasets stored in the package’s /data/ folder as well as several large tables stored in a separate repository specifically created for holding those large tables, which contain information on Census blockgroups, Census block internal points, Census block population weights, and EPA FRS facilities.

How to Update Datasets in EJAM

The process begins from within the EJAM code repo, using the various datacreate_* scripts to create updated arrow datasets. Notes and scripts are consolidated in the /data-raw/ folder, and the starting point is the overarching set of scripts and comments in the file called /data-raw/datacreate_0_UPDATE_ALL_DATASETS.R.

That file covers not only the large arrow datasets that are stored in a separate repository, but also many smaller data objects that are installed along with the package in the /data/ folder. Updating all the package’s data objects can be complicated because there are many different data objects of various types and formats and locations.

The various data objects need to be updated at various frequencies – some only yearly (ACS data) and others when facility IDs and locations change (as often as possible, as when EPA’s FRS is updated). Some need to be updated only when the package features/code changes, such as the important data object called map_headernames and objects such as names_e.

These data files, code details, and other objects change ANNUALLY:

Scripts that update/create the datasets, in EJAM/data-raw/datacreate_xyz.R files, including how those scripts assign version/vintage info via metadata
Version Numbering
metadata about vintage/ version in all datasets like in EJAM/data/ #405 Blockgroup Datasets - See issue Update/ Obtain new EJScreen dataset annually and clean/modify for EJAM #332 (The block (not block group) tables might be updated less often and are listed below). Other data objects (summary info constants, metadata on indicators, etc.) issue Update map_headernames, Names of Indicators (variables), etc. #333 Test data (inputs) and examples of outputs See issue update Test data (inputs) and examples of outputs (every time parameters change & when outputs returned change) #334 Code and documentation in source files See issue update (annually as EJAM does) the functions and documentation in source files #335 Other Documentation See issue update (annually as EJAM does) README, vignettes, other documentation #336 block datasets, etc. See issue Update block datasets etc. when FIPS or boundaries change #337 These change possibly each year if EJScreen block weights & Census fips and/or boundaries change! but certainly at least every 10 years. For frequently-updated (e.g. weekly) datasets, see UPDATES - automate frequent updates of the frs and sic and naics datasets + others #284

Again, all of those updates should be done starting from an understanding of the file called /data-raw/datacreate_0_UPDATE_ALL_DATASETS.R. That script includes steps to update metadata and documentation and save new versions of data in the data folder if appropriate.

The information below focuses on the other type of data objects – the set of large arrow files that are stored outside the package code repository.

Repository that stores the large arrow files

The large mostly-Census-related tables are not installed as part of the R package in the typical /data/ folder that contains .rda files lazyloaded by the package. Instead, they are kept in a separate github repository that we refer to here as the data repository. The current (either installed or loaded source version) of that repository is desc::desc(file = system.file("DESCRIPTION", package = "EJAM"))$get("ejam_data_repo")

IMPORTANT: The name of the data repository (as distinct from the package code repository) must be recorded/ updated in the EJAM package DESCRIPTION file, so that the package will know where to look for the data files if the datasets were moved to a new repository, for examples.

To store the large files needed by the EJAM package, we use the Apache arrow file format through the {arrow} R package, with file extension .arrow. This allows us to work with larger-than-memory data and store it outside of the EJAM package itself.

The names of these tables should be listed in a file called R/arrow_ds_names.R and the global variable called .arrow_ds_names that is used by functions like dataload_dynamic() and dataload_from_local().

As of mid-2025, there were 11 arrow files used by EJAM:

bgid2fips.arrow: crosswalk of EJAM blockgroup IDs (1-n) with 12-digit blockgroup FIPS codes
blockid2fips.arrow: crosswalk of EJAM block IDs (1-n) with 15-digit block FIPS codes
blockpoints.arrow: Census block internal points lat-lon coordinates, EJAM block ID
blockwts.arrow: Census block population weight as share of blockgroup population, EJAM block and blockgroup ID
bgej.arrow: blockgroup-level statistics of EJ variables
quaddata.arrow: 3D spherical coordinates of Census block internal points, with EJAM block ID

frs.arrow: data.table of EPA Facility Registry Service (FRS) regulated sites
frs_by_naics.arrow: data.table of NAICS industry code(s) for each EPA-regulated site in Facility Registry Service
frs_by_sic.arrow: data.table of SIC industry code(s) for each EPA-regulated site in Facility Registry Service
frs_by_programid.arrow: data.table of Program System ID code(s) for each EPA-regulated site in the Facility Registry Service
frs_by_mact.arrow: data.table of MACT NESHAP subpart(s) that each EPA-regulated site is subject to

This document outlines how we will operationalize EJAM’s download, management, and in-app loading of these arrow datasets.

Below is a description of a workable, default operationalization, followed by options for potential improvements via automation and processing efficiency.

Development/Setup

The arrow files are stored in a separate, public, Git-LFS-enabled GitHub repo (henceforth ‘ejamdata’). The owner/reponame must be recorded/updated in the DESCRIPTION file field called ejam_data_repo – that info is used by the package.
Then, and any time the arrow datasets are updated, we update the ejamdata release version via the .github/push_to_ejam.yaml workflow in the ejamdata repo, thereby saving the arrow files with the release, to be downloaded automatically by EJAM
EJAM’s download_latest_arrow_data() function does the following:
1. Checks ejamdata repo’s latest release/version.
2. Checks user’s EJAM package’s ejamdata version, which is stored in data/ejamdata_version.txt.
  1. If the data/ejamdata_version.txt file doesn’t exist, e.g. if it’s the first time installing EJAM, it will be created at the end of the script.
3. If the versions are different, download the latest arrow from the latest ejamdata release with piggyback::pb_download(). see how this function works for details:

download_latest_arrow_data()

We add a call to this function in the onAttach script (via the dataload_dynamic() function) so it runs and ensures the latest arrow files are downloaded when user loads EJAM.

How it Works for the User

User installs EJAM
1. devtools::install_github("USEPA/EJAM-open") (or as adjusted depending on the actual repository owner and name)
User loads EJAM as usual
1. library(EJAM). This will trigger the new download_latest_arrow_data() function.
User runs EJAM as usual
1. The dataload_dynamic() function will work as usual because the data are now stored in the data directory.

How new versions of arrow datasets are republished/ released

The key arrow files are updated from within the EJAM code repository, as explained above.
Those files were then being copied into a clone of the ejamdata repo before being pushed to the actual ejamdata repo on github (at USEPA/ejamdata)
This triggers ejamdata’s push_to_ejam.yaml workflow that increments the latest release tag reflecting the new version and creates a new release

Potential Improvements

Making Code more Arrow-Friendly

Problem: loading the data as tibbles/dataframes takes a long time

Solution: We may be able to modify our code to be more arrow -friendly. This essentially keeps the analysis code as a sort of query, and only actually loads the results into memory when requested (via collect()) This dramatically reduces used memory, which would speed up processing times and avoid potential crashes resulting from not enough memory. However, this would require a decent lift to update the code in all places

Pros: processing efficiency, significantly reduced memory usage

Implementation: This has been mostly implemented by the dataload_dynamic() function, which contains a return_data_table parameter. If FALSE, the arrow file is loaded as an .arrow dataset, rather than a tibble/dataframe.