8 A SPGLM Application to the “Dancer” Damselfly (Argia)

Throughout this section, we will use the following packages:

library(finsyncR)
library(tidyverse)
library(sf)
library(tigris)
library(StreamCatTools)
library(nhdplusTools)
library(spmodel)
library(data.table)
library(pROC)
library(knitr)
library(prism)

Image from: inaturalist.org

To start this exercise, we’ll introduce a new R package called finsyncR. This package was developed by US EPA scientists and academic collaborators.

From the finsyncR GitHub page:

(fish and invertebrate synchronizer in R) is a data management package that integrates and processes national-level aquatic biomonitoring datasets in the U.S., with a focus on fish and macroinvertebrates sampled in rivers and streams.

The sources of data for this package are the United States Environmental Protection Agency’s (USEPA) National Aquatic Resource Surveys (NARS), namely the National River and Streams Assessment (NRSA), and United States Geological Survey’s (USGS) BioData.

In short, finsyncR provides us with access to thousands of biological samples across the U.S. from both EPA and USGS sources (e.g., Rumschlag et al. 2023). We’ll be using EPA data today. The EPA data are from the National Aquatic Resources Assessment’s Wadeable Streams Assessment (2001-2004) and National Rivers and Streams Assessments (NRSA; 2008/2009, 2013/2014, and 2018/2019).

We encourage potential users of finsyncR to thoroughly read the GitHub tutorial and other associated materials. A paper describing the package is currently under review in Methods in Ecology and Evolution.

8.1 Data Prep

8.1.1 Biological (Dependent) Data

Next, we will use finsyncR to get genus-level macroinvert data from just EPA and rarefy to 300 count. The code will also convert the data to occurrence data (1 = detect, 0 = non-detect) and set a seed to make it reproducible. Finally, we will include samples from boatable streams rather than just those that are wadeable.

macros <- getInvertData(dataType = "occur",
                        taxonLevel = "Genus",
                        agency = "EPA",
                        lifestage = FALSE,
                        rarefy = TRUE,
                        rarefyCount = 300,
                        sharedTaxa = FALSE,
                        seed = 1,
                        boatableStreams = T)
#> 
 finsyncR is running: Gathering, joining, and cleaning EPA raw data                    
 finsyncR is running: Rarefying EPA data                              
 finsyncR is running: Applying taxonomic fixes to EPA data                    
 finsyncR is running: Finalizing data for output                          
 finsyncR data synchronization complete

print(dim(macros))
#> [1] 6174  856

# Print an example of the data
kable(macros[1:5, 1:23])

Agency	SampleID	ProjectLabel	SiteNumber	CollectionDate	CollectionYear	CollectionMonth	CollectionDayOfYear	Latitude_dd	Longitude_dd	CoordinateDatum	COMID	StreamOrder	WettedWidth	PredictedWettedWidth_m	NARS_Ecoregion	SampleTypeCode	AreaSampTot_m2	FieldSplitRatio	LabSubsamplingRatio	PropID	Gen_ID_Prop	Ablabesmyia
EPA	10000	NRSA0809	NRS_KS-10018	2008-08-05	2008	8	218	39.01491	-98.01046	NAD83	18865370	2	2.2525	2.68	SPL	BERWW	1.022	NA	NA	1.0000000	0.9373650	1
EPA	10001	NRSA0809	NRS_KS-10008	2008-08-11	2008	8	224	37.39528	-98.92628	NAD83	21012349	3	4.9850	5.37	SPL	BERWW	0.372	NA	NA	0.1770852	0.8576998	0
EPA	10004	NRSA0809	NRS_KS-10040	2008-07-15	2008	7	197	37.78662	-96.42997	NAD83	21515712	4	10.0800	9.05	TPL	BERWW	1.022	NA	NA	0.4374453	0.9310987	1
EPA	10006	NRSA0809	NRS_KS-10064	2008-08-25	2008	8	238	37.99323	-99.32134	NAD83	22082845	7	4.5150	9.18	SPL	BERWW	0.186	NA	NA	0.2187705	0.9659091	0
EPA	1002013	NRSA1314	NRS_KS-10126	2013-04-30	2013	4	120	39.12843	-95.65402	NAD83	3645992	2	5.5500	5.18	TPL	BERW	1.022	NA	NA	0.2916983	0.9026718	0

Let’s massage the table with dplyr to get what we want, including data on the occurences of Argia:

Select columns of interest, including the Argia occurrence column.
Remove data related to the EPA’s Wadeable Streams Assessment (2001-2004).
Convert “CollectionDate” to a date (lubridate::date()) and convert presence/absence to a factor.
Finally, convert table to a sfobject and transform to EPSG:5070.

# Flexible code so we could model another taxon
genus <- 'Argia'

taxon <- macros %>%
  dplyr::select(SampleID, 
                ProjectLabel, 
                CollectionDate,  
                Latitude_dd,
                Longitude_dd,
                all_of(genus))  %>%
  mutate(CollectionDate = date(CollectionDate),
         presence = 
           as.factor(pull(., genus)))  %>% 
  st_as_sf(coords = c('Longitude_dd', 'Latitude_dd'), crs = 4269)  %>% 
  st_transform(crs = 5070)

To visualize the data, we will read in a layer of lower 48 states to give some context. To do this we can use the tigris package. We will also remove non-conterminous states and transform the projection to our favorite - EPSG:5070.

states <- tigris::states(cb = TRUE, progress_bar = FALSE)  %>% 
  filter(!STUSPS %in% c('HI', 'PR', 'AK', 'MP', 'GU', 'AS', 'VI'))  %>% 
  st_transform(crs = 5070)

First, let’s plot the observed presences/absences in the data…

ggplot() + 
  geom_sf(data = states, fill = NA) +
  geom_sf(data = taxon, 
          aes(color = presence),
          size = 1.5,
          alpha = 0.65) + 
  scale_color_manual(values=c("#d9d9d9", "#08519c")) +
  theme_bw() +
  theme(legend.position="bottom")

…and by NRSA survey period:

ggplot() + 
  geom_sf(data = states, fill = NA) +
  geom_sf(data = taxon, 
          aes(color = ProjectLabel),
          size = 1.5,
          alpha = 0.75) + 
  scale_color_manual(values=c("#a6cee3", "#1f78b4", "#b2df8a", "#33a02c")) +
  theme_bw() +
  theme(legend.position="bottom")

For today’s exercise, we’ll narrow down the samples to the northeaster region of the U.S.

Select states from the northeaster U.S.
Select NRSA sample sites that intersect with these states.
Filter to just the 2013/2014 and 2018/2019 sample periods.
Create a new column for sample year.
Select desired columns.


# Filter to study region (states)
region <- states %>% 
  filter(STUSPS %in% c('VT', 'NH', 'ME', 'NY', 'RI',
                       'MA', 'CT', 'NJ', 'PA', 'DE'))

# Use region as spatial filter (sf::st_filter()) for taxon of interest
taxon_rg <- taxon %>% 
  st_filter(region) %>% 
  filter(ProjectLabel %in% c('NRSA1314', 'NRSA1819')) %>% 
  mutate(year = year(ymd(CollectionDate))) %>% 
  select(SampleID:CollectionDate, presence:year) 

ggplot() + 
  geom_sf(data = region, fill = NA) +
  geom_sf(data = taxon_rg, 
          aes(color = presence)) + 
  scale_color_manual(values=c("#d9d9d9", "#08519c")) +
  theme_bw() +
  theme(legend.position="bottom")


ggplot() + 
  geom_sf(data = region, fill = NA) +
  geom_sf(data = taxon_rg, 
          aes(color = ProjectLabel)) + 
  scale_color_manual(values=c("#1f78b4", "#b2df8a")) +
  theme_bw() +
  theme(legend.position="bottom")


taxon_rg %>% 
  pull(presence) %>% 
  table()
#> .
#>   0   1 
#> 485  65

8.1.2 Predictor (Independent) Data

Obtain list of NHDPlus COMIDs that match sample sites from nhdplusTools

Use NLDI service via StreamCat to get the COMIDs
Create a vector of COMIDs by splitting the COMID string
Add COMID to our Argia occurrence table

comids <- sc_get_comid(taxon_rg)

#comids <- read_rds('./data/nrsa_comids.rds')
comid_vect <- 
  comids %>%
  str_split(',') %>%
  unlist() %>%
  as.integer()

taxon_rg <- 
  taxon_rg %>%
  mutate(COMID = comid_vect)

Get non-varying StreamCat data.

sc <- 
  sc_get_data(comid = comids,
              aoi = 'watershed',
              metric = 'bfi, precip8110, wetindex, elev',
              showAreaSqKm = TRUE)

Get year-specific wetlands within watersheds.

NLCD contains data on the distribution of herbaceous (pcthbwet) and woody (pctwdwet) wetlands. We will combine them into a single metric representing the % of the watershed comprised of wetlands.
Duplicate the data, but offset by 1 year so we can 2019 NLCD to 2018 observations (same for 2013 NLCD and 2014 NRSA).

wetlands <- 
  sc_get_data(comid = comids,
              aoi = 'watershed',
              metric = 'pctwdwet2013,pcthbwet2013,pctwdwet2019,pcthbwet2019',
              showAreaSqKm = FALSE) %>% 
  
  # Sum wetland types to create single wetlands metric
  mutate(PCTWETLAND2013WS = PCTHBWET2013WS + PCTWDWET2013WS,
         PCTWETLAND2019WS = PCTHBWET2019WS + PCTWDWET2019WS) %>% 
  
  # Reduce columns
  select(COMID, PCTWETLAND2013WS, PCTWETLAND2019WS) %>% 
  
  # Create long table w/ column name w/out year
  pivot_longer(!COMID, names_to = 'tmpcol', values_to = 'PCTWETLANDXXXXWS') %>% 
  
  # Create new column of year by removing "PCTWETLAND" and "WS" from names
  mutate(year = as.integer(str_replace_all(tmpcol, 'PCTWETLAND|WS', ''))) 

# But some samples have 2014 and 2018 as sample years? How can we trick the data into joining?
# We can match 2019 data to 2018 observations by subtracting a year and appending it to the data

# Create tmp table with 1 added or subtracted to year of record
tmp_wetlands <- wetlands %>% 
  mutate(year = ifelse(year == 2013, year + 1, year - 1))

# rbind() wetlands and tmp_wetlands so we have records to join to 2014 and 2018
wetlands <- wetlands %>% 
  rbind(tmp_wetlands) %>% 
  select(-tmpcol)

Year-specific impervious surfaces within 100-m riparian buffer.

riparian_imp <- 
  sc_get_data(comid = comids,
              aoi = 'riparian_watershed',
              metric = 'pctimp2013, pctimp2019',
              showAreaSqKm = FALSE) %>% 
  select(-WSAREASQKMRP100) %>% 
  pivot_longer(!COMID, names_to = 'tmpcol', values_to = 'PCTIMPXXXXWSRP100') %>% 
  mutate(year = as.integer(
    str_replace_all(tmpcol, 'PCTIMP|WSRP100', '')))

tmp_imp <- riparian_imp %>% 
  mutate(year = ifelse(year == 2013, year + 1, year - 1))

riparian_imp <- riparian_imp %>% 
  rbind(tmp_imp) %>% 
  select(-tmpcol)

PRISM air temperatures for sample periods

The prism package requires that we set a temporary folder in our work space. Here, we set it to “prism_data” inside of our “data” folder. It will create this folder if it does not already exist.
We then stack the climate rasters and use terra::extract() to

# Get these years of PRISM
years <- c(2013, 2014, 2018, 2019)

# Set the PRISM directory (creates directory in not present)
prism_set_dl_dir("./data/prism_data", create = TRUE)

# Download monthly PRISM rasters (tmean)
get_prism_monthlys('tmean', 
                   years = years, 
                   mon = 7:8, 
                   keepZip = FALSE)
#> 
  |                                                                         
  |                                                                   |   0%
  |                                                                         
  |========                                                           |  12%
  |                                                                         
  |=================                                                  |  25%
  |                                                                         
  |=========================                                          |  38%
  |                                                                         
  |==================================                                 |  50%
  |                                                                         
  |==========================================                         |  62%
  |                                                                         
  |==================================================                 |  75%
  |                                                                         
  |===========================================================        |  88%
  |                                                                         
  |===================================================================| 100%

# Create stack of downloaded PRISM rasters
tmn <- pd_stack((prism_archive_subset("tmean","monthly", 
                                      years = years, 
                                      mon = 7:8)))

# Extract tmean at sample points and massage data
tmn <- terra::extract(tmn, 
                      # Transform taxon_rg to CRS of PRISM on the fly
                      taxon_rg %>% 
                        st_transform(crs = st_crs(tmn))) %>%
  
  # Add COMIDs to extracted values
  data.frame(COMID = comid_vect, .) %>%
  
  # Remove front and back text from PRISM year/month in names
  rename_with( ~ stringr::str_replace_all(., 'PRISM_tmean_stable_4kmM3_|_bil', '')) %>% 
  
  # Pivot to long table and calle column TMEANPRISMXXXXPT, XXXX indicates year
  pivot_longer(!COMID, names_to = 'year_month', 
               values_to = 'TMEANPRISMXXXXPT') %>% 
  
  # Create new column of year
  mutate(year = year(ym(year_month))) %>% 
  
  # Average July and August temperatures 
  summarise(TMEANPRISMXXXXPT = mean(TMEANPRISMXXXXPT, na.rm = TRUE), 
            .by = c(COMID, year))

8.1.3 Combine Dependent and Independent Data

argia_model_data <-
  taxon_rg %>%
  left_join(sc, join_by(COMID)) %>%
  left_join(wetlands, join_by(COMID, year)) %>%
  left_join(riparian_imp, join_by(COMID, year)) %>%
  left_join(tmn, join_by(COMID, year)) %>%
  drop_na()

cor(argia_model_data %>%
      st_drop_geometry() %>%
      select(WSAREASQKM:TMEANPRISMXXXXPT))
#>                    WSAREASQKM      ELEVWS WETINDEXWS       BFIWS
#> WSAREASQKM         1.00000000  0.19649723 -0.0844127 -0.17679333
#> ELEVWS             0.19649723  1.00000000 -0.6020139 -0.56349466
#> WETINDEXWS        -0.08441270 -0.60201389  1.0000000  0.40544326
#> BFIWS             -0.17679333 -0.56349466  0.4054433  1.00000000
#> PRECIP8110WS      -0.15455456 -0.02973549 -0.1213487  0.17231568
#> PCTWETLANDXXXXWS  -0.13274790 -0.53982414  0.7838752  0.49003820
#> PCTIMPXXXXWSRP100 -0.10603160 -0.39811621  0.1008221  0.07181972
#> TMEANPRISMXXXXPT   0.08241251 -0.68038439  0.3688678  0.25976247
#>                   PRECIP8110WS PCTWETLANDXXXXWS PCTIMPXXXXWSRP100
#> WSAREASQKM         -0.15455456      -0.13274790       -0.10603160
#> ELEVWS             -0.02973549      -0.53982414       -0.39811621
#> WETINDEXWS         -0.12134865       0.78387523        0.10082210
#> BFIWS               0.17231568       0.49003820        0.07181972
#> PRECIP8110WS        1.00000000       0.03862313        0.17634436
#> PCTWETLANDXXXXWS    0.03862313       1.00000000       -0.04885782
#> PCTIMPXXXXWSRP100   0.17634436      -0.04885782        1.00000000
#> TMEANPRISMXXXXPT    0.04030682       0.24384288        0.43121267
#>                   TMEANPRISMXXXXPT
#> WSAREASQKM              0.08241251
#> ELEVWS                 -0.68038439
#> WETINDEXWS              0.36886778
#> BFIWS                   0.25976247
#> PRECIP8110WS            0.04030682
#> PCTWETLANDXXXXWS        0.24384288
#> PCTIMPXXXXWSRP100       0.43121267
#> TMEANPRISMXXXXPT        1.00000000

8.2 Modeling occurrence of genus Argia

8.2.1 Model formulation

formula <-
  presence ~
  I(log10(WSAREASQKM)) +
  ELEVWS +
  WETINDEXWS +
  BFIWS +
  PRECIP8110WS +
  PCTWETLANDXXXXWS +
  PCTIMPXXXXWSRP100 +
  TMEANPRISMXXXXPT

bin_mod <- spglm(formula = formula,
                 data = argia_model_data,
                 family = 'binomial',
                 spcov_type = 'none')

bin_spmod <- spglm(formula = formula,
                   data = argia_model_data,
                   family = 'binomial',
                   spcov_type = 'exponential')

glances(bin_mod, bin_spmod)
#> # A tibble: 2 × 10
#>   model      n     p  npar value   AIC  AICc logLik deviance pseudo.r.squared
#>   <chr>  <int> <dbl> <int> <dbl> <dbl> <dbl>  <dbl>    <dbl>            <dbl>
#> 1 bin_s…   550     9     3 1394. 1400. 1400.  -697.     201.            0.110
#> 2 bin_m…   550     9     1 1409. 1411. 1411.  -704.     342.            0.145

summary(bin_spmod)
#> 
#> Call:
#> spglm(formula = formula, family = "binomial", data = argia_model_data, 
#>     spcov_type = "exponential")
#> 
#> Deviance Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.4692 -0.4226 -0.2762 -0.1450  2.4044 
#> 
#> Coefficients (fixed):
#>                       Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)           8.089431   6.729301   1.202   0.2293    
#> I(log10(WSAREASQKM))  0.808323   0.202435   3.993 6.52e-05 ***
#> ELEVWS               -0.006700   0.002643  -2.535   0.0112 *  
#> WETINDEXWS            0.003059   0.004851   0.631   0.5283    
#> BFIWS                -0.032951   0.042665  -0.772   0.4399    
#> PRECIP8110WS         -0.003344   0.002894  -1.156   0.2478    
#> PCTWETLANDXXXXWS     -0.016386   0.048407  -0.339   0.7350    
#> PCTIMPXXXXWSRP100     0.088456   0.047796   1.851   0.0642 .  
#> TMEANPRISMXXXXPT     -0.348964   0.149201  -2.339   0.0193 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Pseudo R-squared: 0.1101
#> 
#> Coefficients (exponential spatial covariance):
#>        de        ie     range 
#> 3.121e+00 9.212e-03 3.636e+04 
#> 
#> Coefficients (Dispersion for binomial family):
#> dispersion 
#>          1

8.2.2 Model performance

# Function to convert from log odds to probability
to_prob <- function(x) exp(x)/(1+exp(x))


loocv_mod <- loocv(bin_mod, cv_predict = TRUE, se.fit = TRUE) 
prob_mod <- to_prob(loocv_mod$cv_predict)
sefit_mod <- loocv_mod$se.fit

loocv_spmod <- loocv(bin_spmod, cv_predict = TRUE, se.fit = TRUE)
prob_spmod <- to_prob(loocv_spmod$cv_predict)
sefit_spmod <- loocv_spmod$se.fit

pROC::auc(argia_model_data$presence, prob_mod)
#> Area under the curve: 0.7908
pROC::auc(argia_model_data$presence, prob_spmod)
#> Area under the curve: 0.9154

argia_model_data <- argia_model_data %>%
  mutate(prob_spmod = prob_spmod,
         sefit_spmod = sefit_spmod)

ggplot() +
  geom_sf(data = region, fill = NA) +
  geom_sf(data = argia_model_data,
          aes(color = prob_spmod)) +
  scale_color_viridis_b() +
  theme_bw() +
  theme(legend.position="bottom")


ggplot() +
  geom_sf(data = region, fill = NA) +
  geom_sf(data = argia_model_data,
          aes(color = sefit_spmod)) +
  scale_color_distiller(palette = 'Reds', direction = 1) +
  theme_bw() +
  theme(legend.position="bottom")

8.2.3 EXTRA: Mapping predictions

state <- region %>% 
  filter(STUSPS == "NJ") %>% 
  st_transform(crs = 4326)

# Use get_nhdplus to access the individual stream sub-catchments
pourpoints <- 
  nhdplusTools::get_nhdplus(AOI = state,
                            realization = 'outlet') |> 
  filter(flowdir == "With Digitized")

sc_prd <- sc_get_data(state = 'NJ',
                      aoi = 'watershed,riparian_watershed',
                      metric = 'bfi,precip8110,wetindex,elev,pctwdwet2019,pcthbwet2019,pctimp2019') |> 
  mutate(PCTWETLANDXXXXWS = PCTWDWET2019WS + PCTHBWET2019WS) |> 
  rename(PCTIMPXXXXWSRP100 = PCTIMP2019WSRP100) |> 
  select(COMID, WSAREASQKM, ELEVWS, WETINDEXWS, BFIWS, 
         PRECIP8110WS, PCTWETLANDXXXXWS, PCTIMPXXXXWSRP100)

tmn_prd <- 
  pd_stack((prism_archive_subset("tmean","monthly", 
                                 years = 2019, 
                                 mon = 7:8)))
tmn_prd <-
  terra::extract(tmn_prd, 
                 pourpoints %>% 
                   st_transform(crs = st_crs(tmn_prd))) |> 
  as.tibble() |> 
  mutate(COMID = pourpoints$comid,
         TMEANPRISMXXXXPT = (PRISM_tmean_stable_4kmM3_201907_bil + PRISM_tmean_stable_4kmM3_201908_bil)/2) |> 
  select(COMID, TMEANPRISMXXXXPT)

argia_pred_data <- sc_prd |> 
  left_join(tmn_prd, join_by(COMID)) |>
  left_join(pourpoints, join_by(COMID == comid)) |> 
  st_as_sf() |>
  st_transform(crs = 5070) |>
  select(COMID, WSAREASQKM, ELEVWS, WETINDEXWS,
         BFIWS, PRECIP8110WS, PCTWETLANDXXXXWS,
         PCTIMPXXXXWSRP100, TMEANPRISMXXXXPT) |> 
  na.omit() 

argia_predict <- 
  augment(bin_spmod, 
          newdata = argia_pred_data,
          type = 'response',
          se_fit = TRUE,
          local = TRUE) 

ggplot() +
  geom_sf(data = argia_predict,
          aes(color = .fitted),
          size = 0.9) +
  scale_color_distiller(palette = 'YlOrRd', direction = 2) +
  theme_bw()


ggplot() +
  geom_sf(data = argia_predict,
          aes(color = .se.fit),
          size = 0.9) +
  scale_color_distiller(palette = 'YlOrRd', direction = 2) +
  theme_bw()

8.3 R Code Appendix

library(finsyncR)
library(tidyverse)
library(sf)
library(tigris)
library(StreamCatTools)
library(nhdplusTools)
library(spmodel)
library(data.table)
library(pROC)
library(knitr)
library(prism)
macros <- getInvertData(dataType = "occur",
                        taxonLevel = "Genus",
                        agency = "EPA",
                        lifestage = FALSE,
                        rarefy = TRUE,
                        rarefyCount = 300,
                        sharedTaxa = FALSE,
                        seed = 1,
                        boatableStreams = T)

print(dim(macros))

# Print an example of the data
kable(macros[1:5, 1:23])
# Flexible code so we could model another taxon
genus <- 'Argia'

taxon <- macros %>%
  dplyr::select(SampleID, 
                ProjectLabel, 
                CollectionDate,  
                Latitude_dd,
                Longitude_dd,
                all_of(genus))  %>%
  mutate(CollectionDate = date(CollectionDate),
         presence = 
           as.factor(pull(., genus)))  %>% 
  st_as_sf(coords = c('Longitude_dd', 'Latitude_dd'), crs = 4269)  %>% 
  st_transform(crs = 5070)
states <- tigris::states(cb = TRUE, progress_bar = FALSE)  %>% 
  filter(!STUSPS %in% c('HI', 'PR', 'AK', 'MP', 'GU', 'AS', 'VI'))  %>% 
  st_transform(crs = 5070)
ggplot() + 
  geom_sf(data = states, fill = NA) +
  geom_sf(data = taxon, 
          aes(color = presence),
          size = 1.5,
          alpha = 0.65) + 
  scale_color_manual(values=c("#d9d9d9", "#08519c")) +
  theme_bw() +
  theme(legend.position="bottom") 
ggplot() + 
  geom_sf(data = states, fill = NA) +
  geom_sf(data = taxon, 
          aes(color = ProjectLabel),
          size = 1.5,
          alpha = 0.75) + 
  scale_color_manual(values=c("#a6cee3", "#1f78b4", "#b2df8a", "#33a02c")) +
  theme_bw() +
  theme(legend.position="bottom") 

# Filter to study region (states)
region <- states %>% 
  filter(STUSPS %in% c('VT', 'NH', 'ME', 'NY', 'RI',
                       'MA', 'CT', 'NJ', 'PA', 'DE'))

# Use region as spatial filter (sf::st_filter()) for taxon of interest
taxon_rg <- taxon %>% 
  st_filter(region) %>% 
  filter(ProjectLabel %in% c('NRSA1314', 'NRSA1819')) %>% 
  mutate(year = year(ymd(CollectionDate))) %>% 
  select(SampleID:CollectionDate, presence:year) 

ggplot() + 
  geom_sf(data = region, fill = NA) +
  geom_sf(data = taxon_rg, 
          aes(color = presence)) + 
  scale_color_manual(values=c("#d9d9d9", "#08519c")) +
  theme_bw() +
  theme(legend.position="bottom") 

ggplot() + 
  geom_sf(data = region, fill = NA) +
  geom_sf(data = taxon_rg, 
          aes(color = ProjectLabel)) + 
  scale_color_manual(values=c("#1f78b4", "#b2df8a")) +
  theme_bw() +
  theme(legend.position="bottom") 

taxon_rg %>% 
  pull(presence) %>% 
  table()
comids <- sc_get_comid(taxon_rg)

#comids <- read_rds('./data/nrsa_comids.rds')
comid_vect <- 
  comids %>%
  str_split(',') %>%
  unlist() %>%
  as.integer()

taxon_rg <- 
  taxon_rg %>%
  mutate(COMID = comid_vect) 
sc <- 
  sc_get_data(comid = comids,
              aoi = 'watershed',
              metric = 'bfi, precip8110, wetindex, elev',
              showAreaSqKm = TRUE)
wetlands <- 
  sc_get_data(comid = comids,
              aoi = 'watershed',
              metric = 'pctwdwet2013,pcthbwet2013,pctwdwet2019,pcthbwet2019',
              showAreaSqKm = FALSE) %>% 
  
  # Sum wetland types to create single wetlands metric
  mutate(PCTWETLAND2013WS = PCTHBWET2013WS + PCTWDWET2013WS,
         PCTWETLAND2019WS = PCTHBWET2019WS + PCTWDWET2019WS) %>% 
  
  # Reduce columns
  select(COMID, PCTWETLAND2013WS, PCTWETLAND2019WS) %>% 
  
  # Create long table w/ column name w/out year
  pivot_longer(!COMID, names_to = 'tmpcol', values_to = 'PCTWETLANDXXXXWS') %>% 
  
  # Create new column of year by removing "PCTWETLAND" and "WS" from names
  mutate(year = as.integer(str_replace_all(tmpcol, 'PCTWETLAND|WS', ''))) 

# But some samples have 2014 and 2018 as sample years? How can we trick the data into joining?
# We can match 2019 data to 2018 observations by subtracting a year and appending it to the data

# Create tmp table with 1 added or subtracted to year of record
tmp_wetlands <- wetlands %>% 
  mutate(year = ifelse(year == 2013, year + 1, year - 1))

# rbind() wetlands and tmp_wetlands so we have records to join to 2014 and 2018
wetlands <- wetlands %>% 
  rbind(tmp_wetlands) %>% 
  select(-tmpcol)
riparian_imp <- 
  sc_get_data(comid = comids,
              aoi = 'riparian_watershed',
              metric = 'pctimp2013, pctimp2019',
              showAreaSqKm = FALSE) %>% 
  select(-WSAREASQKMRP100) %>% 
  pivot_longer(!COMID, names_to = 'tmpcol', values_to = 'PCTIMPXXXXWSRP100') %>% 
  mutate(year = as.integer(
    str_replace_all(tmpcol, 'PCTIMP|WSRP100', '')))

tmp_imp <- riparian_imp %>% 
  mutate(year = ifelse(year == 2013, year + 1, year - 1))

riparian_imp <- riparian_imp %>% 
  rbind(tmp_imp) %>% 
  select(-tmpcol)
# Get these years of PRISM
years <- c(2013, 2014, 2018, 2019)

# Set the PRISM directory (creates directory in not present)
prism_set_dl_dir("./data/prism_data", create = TRUE)

# Download monthly PRISM rasters (tmean)
get_prism_monthlys('tmean', 
                   years = years, 
                   mon = 7:8, 
                   keepZip = FALSE)

# Create stack of downloaded PRISM rasters
tmn <- pd_stack((prism_archive_subset("tmean","monthly", 
                                      years = years, 
                                      mon = 7:8)))

# Extract tmean at sample points and massage data
tmn <- terra::extract(tmn, 
                      # Transform taxon_rg to CRS of PRISM on the fly
                      taxon_rg %>% 
                        st_transform(crs = st_crs(tmn))) %>%
  
  # Add COMIDs to extracted values
  data.frame(COMID = comid_vect, .) %>%
  
  # Remove front and back text from PRISM year/month in names
  rename_with( ~ stringr::str_replace_all(., 'PRISM_tmean_stable_4kmM3_|_bil', '')) %>% 
  
  # Pivot to long table and calle column TMEANPRISMXXXXPT, XXXX indicates year
  pivot_longer(!COMID, names_to = 'year_month', 
               values_to = 'TMEANPRISMXXXXPT') %>% 
  
  # Create new column of year
  mutate(year = year(ym(year_month))) %>% 
  
  # Average July and August temperatures 
  summarise(TMEANPRISMXXXXPT = mean(TMEANPRISMXXXXPT, na.rm = TRUE), 
            .by = c(COMID, year))
argia_model_data <-
  taxon_rg %>%
  left_join(sc, join_by(COMID)) %>%
  left_join(wetlands, join_by(COMID, year)) %>%
  left_join(riparian_imp, join_by(COMID, year)) %>%
  left_join(tmn, join_by(COMID, year)) %>%
  drop_na()

cor(argia_model_data %>%
      st_drop_geometry() %>%
      select(WSAREASQKM:TMEANPRISMXXXXPT))

formula <-
  presence ~
  I(log10(WSAREASQKM)) +
  ELEVWS +
  WETINDEXWS +
  BFIWS +
  PRECIP8110WS +
  PCTWETLANDXXXXWS +
  PCTIMPXXXXWSRP100 +
  TMEANPRISMXXXXPT

bin_mod <- spglm(formula = formula,
                 data = argia_model_data,
                 family = 'binomial',
                 spcov_type = 'none')

bin_spmod <- spglm(formula = formula,
                   data = argia_model_data,
                   family = 'binomial',
                   spcov_type = 'exponential')

glances(bin_mod, bin_spmod)

summary(bin_spmod)
# Function to convert from log odds to probability
to_prob <- function(x) exp(x)/(1+exp(x))


loocv_mod <- loocv(bin_mod, cv_predict = TRUE, se.fit = TRUE) 
prob_mod <- to_prob(loocv_mod$cv_predict)
sefit_mod <- loocv_mod$se.fit

loocv_spmod <- loocv(bin_spmod, cv_predict = TRUE, se.fit = TRUE)
prob_spmod <- to_prob(loocv_spmod$cv_predict)
sefit_spmod <- loocv_spmod$se.fit

pROC::auc(argia_model_data$presence, prob_mod)
pROC::auc(argia_model_data$presence, prob_spmod)

argia_model_data <- argia_model_data %>%
  mutate(prob_spmod = prob_spmod,
         sefit_spmod = sefit_spmod)

ggplot() +
  geom_sf(data = region, fill = NA) +
  geom_sf(data = argia_model_data,
          aes(color = prob_spmod)) +
  scale_color_viridis_b() +
  theme_bw() +
  theme(legend.position="bottom")

ggplot() +
  geom_sf(data = region, fill = NA) +
  geom_sf(data = argia_model_data,
          aes(color = sefit_spmod)) +
  scale_color_distiller(palette = 'Reds', direction = 1) +
  theme_bw() +
  theme(legend.position="bottom")
state <- region %>% 
  filter(STUSPS == "NJ") %>% 
  st_transform(crs = 4326)

# Use get_nhdplus to access the individual stream sub-catchments
pourpoints <- 
  nhdplusTools::get_nhdplus(AOI = state,
                            realization = 'outlet') |> 
  filter(flowdir == "With Digitized")

sc_prd <- sc_get_data(state = 'NJ',
                      aoi = 'watershed,riparian_watershed',
                      metric = 'bfi,precip8110,wetindex,elev,pctwdwet2019,pcthbwet2019,pctimp2019') |> 
  mutate(PCTWETLANDXXXXWS = PCTWDWET2019WS + PCTHBWET2019WS) |> 
  rename(PCTIMPXXXXWSRP100 = PCTIMP2019WSRP100) |> 
  select(COMID, WSAREASQKM, ELEVWS, WETINDEXWS, BFIWS, 
         PRECIP8110WS, PCTWETLANDXXXXWS, PCTIMPXXXXWSRP100)

tmn_prd <- 
  pd_stack((prism_archive_subset("tmean","monthly", 
                                 years = 2019, 
                                 mon = 7:8)))
tmn_prd <-
  terra::extract(tmn_prd, 
                 pourpoints %>% 
                   st_transform(crs = st_crs(tmn_prd))) |> 
  as.tibble() |> 
  mutate(COMID = pourpoints$comid,
         TMEANPRISMXXXXPT = (PRISM_tmean_stable_4kmM3_201907_bil + PRISM_tmean_stable_4kmM3_201908_bil)/2) |> 
  select(COMID, TMEANPRISMXXXXPT)

argia_pred_data <- sc_prd |> 
  left_join(tmn_prd, join_by(COMID)) |>
  left_join(pourpoints, join_by(COMID == comid)) |> 
  st_as_sf() |>
  st_transform(crs = 5070) |>
  select(COMID, WSAREASQKM, ELEVWS, WETINDEXWS,
         BFIWS, PRECIP8110WS, PCTWETLANDXXXXWS,
         PCTIMPXXXXWSRP100, TMEANPRISMXXXXPT) |> 
  na.omit() 

argia_predict <- 
  augment(bin_spmod, 
          newdata = argia_pred_data,
          type = 'response',
          se_fit = TRUE,
          local = TRUE) 

ggplot() +
  geom_sf(data = argia_predict,
          aes(color = .fitted),
          size = 0.9) +
  scale_color_distiller(palette = 'YlOrRd', direction = 2) +
  theme_bw()

ggplot() +
  geom_sf(data = argia_predict,
          aes(color = .se.fit),
          size = 0.9) +
  scale_color_distiller(palette = 'YlOrRd', direction = 2) +
  theme_bw()