Skip to contents

Utility to load datasets from AWS DMAP Data Commons, into memory

Usage

dataload_from_aws(
  varnames = .arrow_ds_names[1:3],
  ext = c(".arrow", ".rda")[2],
  fun = c("arrow::read_ipc_file", "load")[2],
  envir = globalenv(),
  mybucket = "dmap-data-commons-oa",
  mybucketfolder = "EJAM",
  folder_local_source = "./data/",
  justchecking = FALSE,
  check_server_even_if_justchecking = TRUE,
  testing = FALSE
)

Arguments

varnames

character vector of the quoted names of the data objects like blockwts or quaddata

ext

like ".arrow" file extension

fun

like "arrow::read_ipc_file" or "load" to use when reading

envir

e.g., globalenv() or parent.frame()

mybucket

where in AWS, like

mybucketfolder

where in AWS, like EJAM

folder_local_source

path of folder (not ending in forward slash) to look in for locally saved copies during development to avoid waiting for download from a server.

justchecking

set to TRUE to get object size (and confirm file is accessible/exists)

check_server_even_if_justchecking

set this to TRUE to stop checking server to see if files are there when justchecking = TRUE. But server is always checked if justchecking = FALSE.

testing

only for testing

Value

nothing - just loads data into environment (unless justchecking=T)

Details

See source code for details.

*** tries dataload_from_local() first (at least during development) to avoid slow downloads.

Also see https://shiny.posit.co/r/articles/improve/scoping/

These files are public-facing – no credentials required.

Use EJAM:::dataload_from_aws(justchecking=TRUE)

or EJAM:::datapack("EJAM") to get info

or tables()

or object.size(quaddata)

blockid2fips was used only in state_from_blockid(), which is no longer used by testpoints_n(), so not loaded unless/until needed. Avoids loading the huge file "blockid2fips" (100MB) and just uses "bgid2fips" (3MB) as needed, that is only 3% as large in memory. blockid2fips was roughly 600 MB in RAM because it stores 8 million block FIPS as text.

Files may include the following:

  • frs (150 MB .arrow file, approx 700 MB RAM)

  • frs_by_programid (approx 500 MB RAM)

  • frs_by_sic (approx 63 MB RAM)

  • frs_by_naics (approx 60 MB RAM)

  • frs_by_mact

  • quaddata (168 MB on disk, 229 MB RAM)

  • blockid2fips ( 20 MB on disk, 621 MB RAM!) No longer needed.

  • blockpoints ( 86 MB on disk, 164 MB RAM)

  • blockwts ( 31 MB on disk, 196 MB RAM)

  • bgej (123 MB RAM)

  • bgid2fips ( 18 MB RAM)