TADA Module 1: Training for Beginner R Users
TADA Team
2024-12-27
Source:vignettes/TADAModule1_BeginnerTraining.Rmd
TADAModule1_BeginnerTraining.Rmd
Welcome
Thank you for your interest in Tools for Automated Data Analysis (TADA). TADA is an open-source tool set built in the R programming language. This RMarkdown document walks users through how to download the TADA R package from GitHub, access and parameterize several important functions, and create basic visualizations with a sample data set. The sample data set contains data from one week from all EPA Region 5 states.
Note: TADA is still under development. New functionality is added weekly, and sometimes we need to make bug fixes in response to tester and user feedback. We appreciate your feedback, patience, and interest in these helpful tools.
If you are interested in contributing to TADA development, more information is available at [Contributing] (https://usepa.github.io/EPATADA/articles/CONTRIBUTING.html). We welcome collaboration with external partners.
Install and load packages
First, install and load the remotes package specifying the repo. This is needed before installing TADA because it is only available on GitHub.
install.packages("remotes",
repos = "http://cran.us.r-project.org"
)
library(remotes)
Next, install and load TADA using the remotes package. TADA R Package dependencies will also be downloaded automatically from CRAN with the TADA install. You may be prompted in the console to update dependency packages that have more recent versions available. If you see this prompt, it is recommended to update all of them (enter 1 into the console).
remotes::install_github("USEPA/EPATADA",
ref = "develop",
dependencies = TRUE
)
# remotes::install_github("USGS-R/dataRetrieval", dependencies=TRUE)
Finally, use the library() function to load the TADA R Package into your R session.
Help pages
All TADA R package functions have their own individual help pages,
listed on the Function
reference page on the GitHub site. Users can also access the help
page for a given function in R or RStudio using the following format
(example below): ?[name of TADA function]
# Access help page for TADA_DataRetrieval
?TADA_DataRetrieval
Retrieve WQP data
WQP data is retrieved and processed for compatibility with TADA. This function, TADA_DataRetrieval, builds on USGS’s dataRetrieval R package functions. It joins three WQP profiles: Site, Sample Results (physical/chemical metadata), and Project. In addition, when the function input applyautoclean = TRUE, it changes all data in the characteristic, speciation, fraction, and unit fields to uppercase and addresses result values that include special characters.
Query filters are consistent with those available from Water Quality Portal (WQP) and can include:
startDate
endDate
characteristicName
sampleMedia
siteType
statecode (review list of possible state and territory abbreviations)
countycode
siteid
organization
project
huc
characteristicType
Tips:
-
All the query filters for the WQP work as an AND but within the fields there are ORs. For example:
Characteristics: If you choose pH & DO - it’s an OR. You will retrieve both pH OR DO data if available.
States: Similarly, if you choose MI and IL, it’s an OR. You will retrieve both MI OR IL data if available.
Combinations of fields are ANDs, such as State/MI AND Characteristic/DO. This means you will receive all DO data available in MI.
Characteristic and Characteristic Type also work as an AND. This means that the Characteristic must fall within the CharacteristicGroup if both filters are being used, if not you will get an error.
The “siteid” is a general term WQP uses to describe both Site IDs from USGS databases and Monitoring Location Identifiers (from WQX). Each monitoring location in the Water Quality Portal (WQP) has a unique Monitoring Location Identifier, regardless of the database from which it derives.
The example dataset in this demo includes monitoring data from multiple organizations collected during one week (May 1, 2019 - May, 7 2019) in Region 5 states (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin).
For demonstration purposes, the R5 example data set has already been downloaded. The default TADA_DataRetrieval function automatically runs the TADA_AutoClean function. In this example, we set TADA_AutoClean = FALSE and run it as a separate step in the work flow. You can run the code chunk below to access the pre-downloaded data set. The original query has also been included and can be accessed by uncommenting and running the TADA_DataRetrieval function below.
Downloads using TADA_DataRetrieval will have the same columns each time. Any columns with a “TADA” prefix were added during data retrieval. TADA_DataRetrieval automatically converts the date times to UTC and data to dates, datetimes, and numerics based on a standard algorithm. Data are uploaded to the Water Quality Portal by individual organizations, which may or may not follow the same conventions. Data and metadata quality are not guaranteed! Carefully explore any data and make conservative quality assurance decisions where information is limited.
Note: USGS and EPA are working together to create WQP 3.0 data profiles. Once released, one data profile will contain the columns critical to TADA, removing the need to combine profiles. TADA package users likely will not notice a difference in their usage of the TADA_DataRetrieval function, but it will simplify the steps needed to upload a custom or WQP GUI-downloaded data set into the R package.
# Uncomment query below to download data set from WQP
# TADAProfile <- TADA_DataRetrieval(statecode = c("IL", "IN", "MI", "MN", "OH", "WI"), startDate = "2019-05-01", endDate = "2019-05-07", applyautoclean = FALSE)
# For demo purposes, import pre-downloaded R5 data set
R5Profile <- Data_R5_TADAPackageDemo
R5 Profile
The table above shows the first 100 records from the R5 Water Quality Portal Query.
Question 1: How many results did TADA_DataRetrieval return?
# Determine number of rows in the data set
R5_nresults <- nrow(R5Profile)
If you need to download a large amount of data from across a large area, and the TADA_DataRetrieval function is not working due to WQP timeout issues, then the TADA_BigDataRetrieval function may work better. See ?TADA_BigDataRetrieval for more details or the TADA Module 1 vignette for more details.
Review and Filter By Media Type
TADA is primarily designed to accommodate water data from the WQP. Let’s see what activity media types are represented in the data set.
To help answer this question, we can use TADA functions to create a table or pie chart of the counts of unique values in a user-specified field with TADA_FieldValuesTable or TADA_FieldValuesPie. We’ll start with TADA.ActivityMediaName.
# Create pie chart for ActivityMediaName
ActMedName_Pie <- TADA_FieldValuesPie(R5Profile, field = "ActivityMediaName")
## Error in get(paste0(generic, ".", class), envir = get_method_env()) :
## object 'type_sum.accel' not found
ActMedName_Pie
# Create table with count for each ActivityMediaName
FieldValues_ActMedia <- TADA_FieldValuesTable(R5Profile, field = "ActivityMediaName")
Question 2: How many unique ‘ActivityMediaName’ values exist in your dataset? Are there any media types that are not water?
n_media <- length(unique(FieldValues_ActMedia$Value))
unique_media <- FieldValues_ActMedia %>%
dplyr::select(Value) %>%
dplyr::mutate(media = paste(Value, collapse = ", ")) %>%
dplyr::select(media) %>%
unique() %>%
stringi::stri_replace_last(fixed = ",", " and")
n_not_water <- FieldValues_ActMedia %>%
dplyr::filter(Value != "Water") %>%
dplyr::summarize(n_not_water = length(Value))
Filter for Surface Water Data
Some TADA users are interested in using WQP data for surface water only or including some non-water data. The TADA_AnalysisDataFilter identifies surface water, groundwater, and sediment results. Multiple columns are used to identify groundwater results as organizations may populate different combinations of fields in order to identify groundwater results.
Users can specify whether all results should be returned with a new column, TADA.UseForAnalysis.Flag, providing information about the result’s media type and identifying if the result should be included in further analysis, or if only results that should be included are returned.
The defaults are to include surface water, exclude groundwater and sediment, and to return all results with the added TADA.UseForAnalysis.Flag column (clean = FALSE). This is shown in the example below.
We can use TADA_FieldValuesTable and TADA_FieldValuesPie again to review this more specific breakdown of results by media type.
Note: When WQX 3.0 data profiles are released, this function will be updated to include fish tissue a filter option.
# Filter to flag only surface water results for use in analysis
R5Profile <- TADA_AnalysisDataFilter(R5Profile, clean = FALSE)
## [1] "TADA_AnalysisDataFilter: Identifying groundwater results."
## [1] "TADA_AnalysisDataFilter: Flagging surface water results to include in assessments."
## [1] "TADA_AnalysisDataFilter: Flagging groundwater results to exclude from assessments."
## [1] "TADA_AnalysisDataFilter: Flagging sediment results to exclude from assessments."
## [1] "TADA_AnalysisDataFilter: Returning all results with TADA.UseForAnalysis.Flag column indicating if result should be used for assessments."
# Create pie chart for TADA.UseForAnalysis.Flag
FieldVal_Pie <- TADA_FieldValuesPie(R5Profile, field = "TADA.UseForAnalysis.Flag")
FieldVal_Pie
FieldValues_AnalysisFlag <- TADA_FieldValuesTable(R5Profile, field = "TADA.UseForAnalysis.Flag")
Question 3: How many SURFACE WATER results can we retain for further analysis
AutoClean
Now TADA_AutoClean can be run on the surface water-only data set. It performs the following functions on the data retrieved from the WQP:
TADA_ConvertSpecialChars - converts result value columns to numeric and flags non-numeric values that could not be converted.
TADA_ConvertResultUnits - unifies result units for easier quality control and review
TADA_ConvertDepthUnits - converts depth units to a consistent unit (meters). This function can also be run on its own, for example if you would like to convert depths from meters to feet later in the workflow.
TADA_IDCensoredData - categorizes detection limit data and identifies mismatches in result detection condition and result detection limit type.
Other helpful actions - converts important text columns to all upper-case letters, removes exact duplicates, and uses WQX format rules to harmonize specific NWIS metadata conventions (e.g. move characteristic speciation from the TADA.ResultMeasure.MeasureUnitCode column to the TADA.MethodSpeciationName column)
As a general rule, TADA functions do not change any contents in the WQP-served columns. Instead, they add new columns with the prefix “TADA.” The following columns are numeric versions of their WQP origins: TADA.ResultMeasureValue, TADA.DetectionQuantitationLimitMeasure.MeasureValue, TADA.LatitudeMeasure, and TADA.LongitudeMeasure.
These functions also add the columns TADA.ResultMeasureValueDataTypes.Flag and TADA.DetectionQuantitationLimitMeasure.MeasureValueDataTypes.Flag, which provide information about the result values that is needed to address censored data later on. These new columns flag if special characters are included in result values, and specify what the special characters are.
# run TADA_AutoClean on filtered dataset to convert special characters, result units, and depth units and identify censored data.
R5Profile <- TADA_AutoClean(R5Profile)
## [1] "TADA_Autoclean: creating TADA-specific columns."
## [1] "TADA_Autoclean: harmonizing dissolved oxygen characterisic name to DISSOLVED OXYGEN SATURATION if unit is % or % SATURATN."
## [1] "TADA_Autoclean: handling special characters and coverting TADA.ResultMeasureValue and TADA.DetectionQuantitationLimitMeasure.MeasureValue value fields to numeric."
## [1] "TADA_Autoclean: converting TADA.LatitudeMeasure and TADA.LongitudeMeasure fields to numeric."
## [1] "TADA_Autoclean: harmonizing synonymous unit names (m and meters) to m."
## [1] "TADA_Autoclean: updating deprecated (i.e. retired) characteristic names."
## [1] "518 results in your dataset have one of the following deprecated characteristic names: 1,2-Benzenedicarboxamide, N2-[1,1-dimethyl-2-(methylsulfonyl)ethyl]-3-iodo-N1-[2-methyl-4-[1,2,2,2-tetrafluoro-1-(trifluoromethyl)ethyl]phenyl]-; Desisopropyl atrazine; Inorganic nitrogen (nitrate and nitrite); Inorganic nitrogen (nitrate and nitrite) ***retired***use Nitrate + Nitrite; Nutrient-nitrogen***retired***use TOTAL NITROGEN, MIXED FORMS with speciation AS N. These names have been substituted with the updated preferred names in the TADA.CharacteristicName field."
## [1] "TADA_Autoclean: harmonizing result and depth units."
## [1] "TADA_Autoclean: creating TADA.ComparableDataIdentifier field for use when generating visualizations and analyses."
## [1] "NOTE: This version of the TADA package is designed to work with numeric data with media name: 'WATER'. TADA_AutoClean does not currently remove (filter) data with non-water media types. If desired, the user must make this specification on their own outside of package functions. Example: dplyr::filter(.data, TADA.ActivityMediaName == 'WATER')"
Monitoring Location Review
Now let’s take a look at a breakdown of these monitoring location types. The previous function removed non-surface water results. Depending upon your program’s goals and methods, you might want to further filter the monitoring location types in the data set.
MonLocTypNam_Pie <- TADA_FieldValuesPie(R5Profile, field = "MonitoringLocationTypeName")
MonLocTypNam_Pie
Question 4: How many unique Monitoring Location Types (MonitoringLocationTypeName) are present? Which is the most common?
FieldValues_MLs_table <- TADA_FieldValuesTable(R5Profile, field = "MonitoringLocationTypeName")
mlt_n <- length(unique(FieldValues_MLs_table$Value))
mlt_most_common <- FieldValues_MLs_table %>%
dplyr::slice_max(Count)
Now, let’s filter by mlt_most_common and focus the rest of this demo on the mlt_most_common[1] subset of results.
R5Profile <- R5Profile %>%
dplyr::filter(MonitoringLocationTypeName == dplyr::pull(mlt_most_common[1]))
Next we can review station locations and summary information using the TADA_OverviewMap function. TADA_OverviewMap counts the number of unique results, characteristics, and organizations at each monitoring location and creates a map for reviewing summary stats. Larger point sizes indicate more results, while darker blue colors indicate more unique characteristics collected at the site. Users may click on a site to view a pop-up with this summary information. This can be a good first step in identifying incorrect coordinates that are outside of the desired state(s) of interest or outside of the US.
TADA_OverviewMap(R5Profile)
Another function, TADA_FlagCoordinates, identifies and flags potentially invalid coordinate data. Its functionality will not be showcased in this demo due to time constraints. For details see TADA: Module 1 or enter ?TADA_FlagCoordinates in the R console for more information and examples.
Summarize by Characteristic
TADA_SummarizeColumn summarizes the data set based on a user specified column and returns a data frame displaying the number of sites and number of records for each unique value in the specified column. This example uses TADA.CharacteristicName.
# Review the number of sites and number of records for each CharacteristicName in TADAProfile
R5Profile_CharSummary <- TADA_SummarizeColumn(R5Profile, "TADA.CharacteristicName")
Question 5: How many unique values of TADA.CharacteristicName are included in the data set? Which TADA.CharacteristicName was collected at the greatest number of sites? Which TADA.CharacteristicName has the most results?
# Number of unique values of TADA.CharacteristicName
n_char_name <- length(unique(R5Profile_CharSummary$TADA.CharacteristicName))
# Filter for record with max number of sites per Character
max_sites <- R5Profile_CharSummary %>%
dplyr::slice_max(n_sites)
# Filter for record with max number of results per Character
max_results <- R5Profile_CharSummary %>%
dplyr::slice_max(n_records)
WQX Quality Assurance and Quality Control (QAQC) Service Result Flags
Next we’ll run several functions to address suspect method, fraction, speciation, and unit metadata by characteristic. The default is clean = TRUE, which will remove suspect results. You can change this to clean = FALSE to flag results, but not remove them.
See documentation for more details:
?TADA_FlagMethod
?TADA_FlagSpeciation
?TADA_FlagResultUnit
?TADA_FlagFraction
R5ProfileClean1 <- TADA_FlagMethod(R5Profile, clean = TRUE)
R5ProfileClean1 <- TADA_FlagFraction(R5ProfileClean1, clean = TRUE)
R5ProfileClean1 <- TADA_FlagSpeciation(R5ProfileClean1, clean = "suspect_only")
R5ProfileClean1 <- TADA_FlagResultUnit(R5ProfileClean1, clean = "suspect_only")
Question 6: How many rows were removed due to the WQX QAQC service functions? What might be a helpful first step in investigating why these removals occurred?
WQX national upper and lower thresholds
Run the following code to flag or remove results that are above or below the national upper and lower bound for each characteristic and unit combination. See documentation for more details:
?TADA_FlagAboveThreshold
?TADA_FlagBelowThreshold
R5ProfileClean2 <- TADA_FlagAboveThreshold(R5ProfileClean1, clean = TRUE)
R5ProfileClean3 <- TADA_FlagBelowThreshold(R5ProfileClean2, clean = TRUE)
## [1] "No data below the WQX Lower Threshold were found in your dataframe. Returning the input dataframe with TADA.ResultValueBelowLowerThreshold.Flag column for tracking."
Question 7: How many results were removed due to the WQX upper and lower thresholds?
Potential duplicates
Sometimes multiple organizations submit the exact same data to Water Quality Portal (WQP). Let’s start by looking at the submitting organizations for this data set and their relative contributions.
# Create data frame with result counts by organization
FieldValues_Orgs <- TADA_FieldValuesTable(R5ProfileClean3, field = "OrganizationFormalName")
Question 8: How many organizations submitted data? Which organization submitted the most results? Which organization submitted the fewest results?
orgs_n <- length(unique(FieldValues_Orgs$Value))
org_max <- dplyr::slice_max(FieldValues_Orgs, Count)
org_min <- dplyr::slice_min(FieldValues_Orgs, Count)
Organizations occasionally submit the same data multiple times to the Portal. The following functions check for and identify data that may be duplicates based on date, time, characteristic, result value, and a distance buffer. Each pair or group of potential duplicate rows is flagged with a unique ID. For more information, review the documentation by entering the following into the console:
- ?TADA_FindPotentialDuplicatesMultipleOrgs
- ?TADA_FindPotentialDuplicatesSingleOrg
# Flag potential duplicates from multiple organizations
R5ProfileClean3 <- TADA_FindPotentialDuplicatesMultipleOrgs(R5ProfileClean3,
dist_buffer = 100,
org_hierarchy = "none"
)
## [1] "84 potentially duplicated results found in dataset. These have been placed into duplicate groups in the TADA.MultipleOrgDupGroupID column and the TADA.MultipleOrgDuplicate column is set to 'Y' (yes). If you provided an organization hierarchy, the result with the lowest ranked organization identifier was selected as the representative result in the TADA.ResultSelectedMultipleOrgs (this column is set to 'Y' for all results either selected or not considered duplicates)."
# Flag potential duplicates from a single organization
R5ProfileClean3 <- TADA_FindPotentialDuplicatesSingleOrg(R5ProfileClean3)
## [1] "38 groups of potentially duplicated results found in dataset. These have been placed into duplicate groups in the TADA.SingleOrgDupGroupID column and the function randomly selected one result from each group to represent a single, unduplicated value. Selected values are indicated in the TADA.SingleOrgDup.Flag as 'Unique', while duplicates are flagged as 'Duplicate' for easy filtering."
# Remove multiple and single org duplicates (random selection of single result)
R5ProfileClean4 <- R5ProfileClean3 %>%
dplyr::filter(
TADA.ResultSelectedMultipleOrgs == "Y",
TADA.SingleOrgDup.Flag == "Unique"
)
Question 9: How many total duplicate results were removed? How could you prioritize retaining results from your organization when duplicate results are identified?
Full Dataframe Filtering
In this section a TADA user can review the unique values in specific fields and may choose to remove data with particular values.
To start, review the list of common fields used for filtering, and the number of unique values in each field using the TADA_FieldCounts function.
This function returns counts for you the entire data frame for each of the following fields (if populated, columns that are populated only with NA’s are not included in the output):
ActivityTypeCode
TADA.ActivityMediaName
ActivityMediaSubdivisionName
ActivityCommentText
MonitoringLocationTypeName
StateCode
OrganizationFormalName
TADA.CharacteristicName
HydrologicCondition
HydrologicEvent
BiologicalIntentName
MeasureQualifierCode
ActivityGroup
AssemblageSampledName
ProjectName
CharacteristicNameUserSupplied
DetectionQuantitationLimitTypeName
SampleTissueAnatomyName
LaboratoryName
# Create table of field counts
FieldCounts_Table_R5 <- TADA_FieldCounts(R5ProfileClean4)
The ActivityTypeCode field has multiple unique values. Before we remove the QC samples/measurements from this dataset to prepare for analyses, lets review flagged Quality Control (QC) values using the TADA_FindQCActivities function, which a TADA.ActivityType.Flag column.
For example, the new QC_replicate flag in TADA.ActivityType.Flag column indicates that the flagged rows include any of the following replicate values: Quality Control Field Replicate Habitat Assessment, Quality Control Field Replicate Msr/Obs, Quality Control Field Replicate Portable Data Logger, Quality Control Field Replicate Sample-Composite and Quality Control Sample-Field Replicate.
See WQX domain file to review all the ActivityTypeCode allowable values: https://cdx.epa.gov/wqx/download/DomainValues/ActivityType.CSV
Other functions, such as TADA_PairReplicates which identifies replicates and their original (parent) samples, may also be useful when reviewing QC results. For now, let’s simply remove QC samples/measurements from the dataframe.
For more information on TADA_PairReplicates and suggestions for handling paired replicates (outside the TADA package) see TADA Module 1 or enter “?TADA_PairReplicates” in the console.
# Remove flagged QC samples using the TADA_FindQCActivities function:
R5ProfileClean5 <- TADA_FindQCActivities(R5ProfileClean4,
clean = TRUE
)
## [1] "Quality control samples have been removed or were not present in the input dataframe. Returning dataframe with TADA.ActivityType.Flag column for tracking."
# Generate pie chart for ActivityTypeCode
ActTypCod_Pie <- TADA_FieldValuesPie(R5ProfileClean5, "ActivityTypeCode")
ActTypCod_Pie
Question 10: How many ActivityTypeCodes are present after removing QC samples? Which ActivityTypeCode is the most common?
# Number of ActivityTypeCodes
activity_type_code_n <- length(unique(R5ProfileClean5$ActivityTypeCode))
# Most common Activity Type Code
most_common_activity <- TADA_FieldValuesTable(R5ProfileClean5, "ActivityTypeCode") %>%
dplyr::slice_max(Count)
We’ve completed our review of the ActivityTypeCode. Now, let’s move on to a different field and see if there are any values that we want to remove. In this next example, there are multiple MeasureQualifierCode values to review.
MQC_Pie <- TADA_FieldValuesPie(R5ProfileClean5, "MeasureQualifierCode")
MQC_Pie
MeasureQualifierCode definitions are available here.
In this example, we show how to use the function TADA_FlagMeasureQualifierCode to add MeasureQualifierCode definitions and flag and/or remove rows with specific codes under MeasureQualifierCode that are categorized as “SUSPECT”.
See ?TADA_FlagMeasureQualifierCode for more information.
# Flag only
Review_R5ProfileClean5 <- TADA_FlagMeasureQualifierCode(R5ProfileClean5,
clean = FALSE,
flaggedonly = TRUE,
define = TRUE
)
# Review number of suspect results
suspect_n <- Review_R5ProfileClean5 %>%
dplyr::filter(TADA.MeasureQualifierCode.Flag == "SUSPECT") %>%
nrow()
# Run function with clean = TRUE
R5ProfileClean5 <- TADA_FlagMeasureQualifierCode(R5ProfileClean5,
clean = TRUE
)
Question 11: How many results were identified as “SUSPECT” based on the MeasureQualifierCode?
Censored data
Censored data are measurements for which the true value is not known, but we can estimate the value based on lower or upper detection conditions and limit types. TADA fills missing TADA.ResultMeasureValue and TADA.ResultMeasure.MeasureUnitCode values with values and units from TADA.DetectionQuantitationLimitMeasure.MeasureValue and TADA.DetectionQuantitationLimitMeasure.MeasureUnitCode, respectively, using the TADA_AutoClean function. In other words, detection limit information is copied and pasted into the result value column when the original value is NA and detection limit information is available. The two columns TADA focuses on to define and flag censored data are ResultDetectionConditionText and DetectionQuantitationLimitTypeName.
The TADA package currently has functions that summarize censored data incidence in the dataset and perform simple substitutions of censored data values, including x times the detection limit and random selection of a value between 0 and the detection limit. The user may specify the methods used for non-detects and over-detects separately in the input to the TADA_SimpleCensoredMethods function.
All censored data functions depend first on the TADA_IDCensoredData utility function, which assigns a TADA.CensoredData.Flag to all data records and identifies over-detects from non-detects using the ResultDetectionConditionText and DetectionQuantitationLimitTypeName. This utility function is automatically run within the TADA_AutoClean function and produces the TADA.CensoredData.Flag column. All records receive one of the following classifications: Uncensored; Non-Detect; Over-Detect; Other Condition/Limit Populated; Conflict between Condition and Limit; Detection condition or detection limit is not documented in TADA; or Detection condition is missing and required for censored data ID.
The TADA_SimpleCensoredMethods function also adds a TADA.MeasureQualifierCode.Def column which contains the MeasureQualiferCode concatenated with the WQX definition for each qualifier code. This provides additional information to the user which may assist in deciding which records to retain for analysis.
The next step we take in this example is to perform simple conversions to the censored data in the dataset: we keep over-detects as is (no conversion made) and convert non-detect values to 0.5 times the detection limit (half the detection limit). Please review ?TADA_Stats and ?TADA_SimpleCensoredMethods for more information.
R5ProfileClean5 <- TADA_SimpleCensoredMethods(R5ProfileClean5,
nd_method = "multiplier",
nd_multiplier = 0.5,
od_method = "as-is",
od_multiplier = "null"
)
## [1] "TADA_IDCensoredData: 18 records in supplied dataset have conflicting detection condition and detection limit type information. These records will not be included in detection limit handling calculations."
Question 12: How many NAs in the TADA.ResultMeasureValue column remain after running TADA_SimpleCensoredMethods? What is the difference between the values in the MeasureQualifierCode and TADA.MeasureQualifierCode.Def columns?
# Determine number of NA results remaining
result_na_n <- sum(is.na(R5ProfileClean5$TADA.ResultMeasureValue))
# Show unique values of MeasureQualifierCode and TADA.MeasureQualifierCode.Def
mqc <- R5ProfileClean5 %>%
dplyr::select(MeasureQualifierCode, TADA.MeasureQualifierCode.Def) %>%
unique()
Next, we need to filter down to only numeric data. Remove data where the TADA.ResultMeasureValueDataTypes.Flag = “Text” or “NA - Not Available”. You can also remove any columns not required for the TADA workflow that contain only NAs. The TADA_AutoFilter function removes non-numeric data and optional columns containing only NAs.
# Removes rows where the result value is not numeric. Specifically, removes rows with "Text" or "NA - Not Available" in the TADA.ResultMeasureValueDataTypes.Flag column, or NA in the TADA.ResultMeasureValue column. Removes optional columns containing only NAs.
R5ProfileClean6 <- TADA_AutoFilter(R5ProfileClean5)
## [1] "TADA_Autofilter: removing columns not required for TADA workflow if they contain only NAs."
## [1] "The following column(s) were removed as they contained only NAs: ActivityDepthAltitudeReferencePointText, SampleAquifer, ResultParticleSizeBasisText, USGSPCode, BinaryObjectFileName, BinaryObjectFileTypeCode, LabSamplePreparationUrl, HorizontalAccuracyMeasure.MeasureValue, HorizontalAccuracyMeasure.MeasureUnitCode, VerticalAccuracyMeasure.MeasureValue, VerticalAccuracyMeasure.MeasureUnitCode, FormationTypeText, ProjectMonitoringLocationWeightingUrl, DrainageAreaMeasure.MeasureValue, DrainageAreaMeasure.MeasureUnitCode, ContributingDrainageAreaMeasure.MeasureValue and ContributingDrainageAreaMeasure.MeasureUnitCode."
## [1] "TADA_Autofilter: checking required columns for non-NA values."
## [1] "TADA Required column(s) SubjectTaxonomicName, SampleTissueAnatomyName, ResultDepthHeightMeasure.MeasureValue, TADA.ResultDepthHeightMeasure.MeasureValue, ResultDepthHeightMeasure.MeasureUnitCode, TADA.ResultDepthHeightMeasure.MeasureUnitCode, ResultDepthAltitudeReferencePointText, ActivityTopDepthHeightMeasure.MeasureValue, TADA.ActivityTopDepthHeightMeasure.MeasureValue, ActivityTopDepthHeightMeasure.MeasureUnitCode, TADA.ActivityTopDepthHeightMeasure.MeasureUnitCode, ActivityBottomDepthHeightMeasure.MeasureValue, TADA.ActivityBottomDepthHeightMeasure.MeasureValue, ActivityBottomDepthHeightMeasure.MeasureUnitCode, TADA.ActivityBottomDepthHeightMeasure.MeasureUnitCode, ResultFileUrl, ResultAnalyticalMethod.MethodUrl, HydrologicCondition, HydrologicEvent, DataQuality.PrecisionValue, DataQuality.BiasValue, DataQuality.ConfidenceIntervalValue, DataQuality.UpperConfidenceLimitValue, DataQuality.LowerConfidenceLimitValue, ResultLaboratoryCommentText, AquiferName, AquiferTypeName, LocalAqfrName, ConstructionDateText, WellDepthMeasure.MeasureValue, WellDepthMeasure.MeasureUnitCode, WellHoleDepthMeasure.MeasureValue and WellHoleDepthMeasure.MeasureUnitCode contain only NA values. This may impact other TADA functions."
## [1] "Function removed 1374 results. These results are either text or NA and cannot be plotted or represent quality control activities (not routine samples or measurements)."
Question 13: How many NAs remain in the data set for TADA.ResultMeasureValue after running TADA_Autofilter? How many non-numeric values in TADA.ResultMeasureValueDataTypes.Flag remain in the data set? How many rows were removed from the data set?
result_nas_n <- sum(is.na(R5ProfileClean6$TADA.ResultMeasureValue))
non_numeric_n <- sum(!R5ProfileClean6$TADA.ResultMeasureValueDataTypes.Flag %in% c("Numeric", "Result Value/Unit Estimated from Detection Limit", "Result Value/Unit Copied from Detection Limit"))
R5ProfileClean5_n <- nrow(R5ProfileClean5)
R5ProfileClean6_n <- nrow(R5ProfileClean6)
autofilter_removed_n <- R5ProfileClean5_n - R5ProfileClean6_n
Convert synonymous characteristic, fraction, speciation, and unit values to a consistent convention based on user-defined/TADA standards
The TADA_GetSynonymRef function generates a synonym reference table that is specific to the input dataframe. Users can review how their input data relates to standard TADA values for the following elements:
TADA.CharacteristicName
TADA.ResultSampleFractionText
TADA.MethodSpeciationName
TADA.ResultMeasure.MeasureUnitCode
Users can also edit the reference file to meet their needs if desired. The download argument can be used to save the harmonization file to your current working directory when download = TRUE, the default is download = FALSE.
The TADA_HarmonizeSynonyms function then compares the input dataframe to the TADA Synonym Reference Table and makes conversions where target characteristics/fractions/speciations/units are provided. This function also appends a column called TADA.Harmonized.Flag, indicating which results had metadata changed/converted in this function. The purpose of this function is to make similar data consistent and therefore easier to compare and analyze.
Here are some examples of how the TADA_HarmonizeSynonyms can be useful:
TADA.ResultSampleFractionText specifies forms of constituents. In some cases, a single TADA.CharacteristicName will have both “Total” and “Dissolved” forms specified, which should not be combined. In these cases, each TADA.CharacteristicName and TADA.ResultSampleFractionText combination is given a different identifier. This identifier can be used later on to identify comparable data groups (TADA.ComparableDataIdentifier) for calculating statistics and creating figures for each combination.
Some variables have different names but represent the same constituent (e.g., “Total Kjeldahl nitrogen (Organic N & NH3)” and “Kjeldahl nitrogen”). The TADA_HarmonizeSynonyms function gives a consistent name (and identifier) to synonyms.
# Get harmonization reference table
UniqueHarmonizationRef <- TADA_GetSynonymRef(R5ProfileClean6)
# Filter out records without target to display in table
UniqueHarmonizationRef_no_na <- UniqueHarmonizationRef %>%
dplyr::filter(is.na(Target.TADA.CharacteristicName))
# Harominze data set
R5ProfileClean7 <- TADA_HarmonizeSynonyms(R5ProfileClean6,
ref = UniqueHarmonizationRef
)
The table above shows the first 100 rows of the UniqueHarmonizationRef table.
Total Nitrogen and Total Phosphorus Calculations
This section covers summing nutrient subspecies to estimate total nitrogen and phosphorus. This can be challenging because some subspecies/compounds overlap in total nutrient calculations. Thus, TADA_CalculateTotalNP uses the Nutrient Aggregation logic to add subspecies to obtain a total. TADA adds one more equation to the mix: total particulate nitrogen + total dissolved nitrogen. The function uses as many subspecies as possible to calculate a total for each given site, date, and depth group, but it will estimate total nitrogen with whatever subspecies are present. This function creates NEW total nutrient measurements (total nitrogen unfiltered as N and total phosphorus unfiltered as P) and adds them to the dataframe.
Users can use the default summation worksheet (see TADA_GetNutrientSummationRef) or customize it. The function also requires a daily aggregation value, either minimum, maximum, or mean. The default is ‘max’, which means that if multiple measurements of the same subspecies-fraction-speciation-unit occur on the same day at the same site and depth, the function will pick the maximum value to use for summation.
R5ProfileClean8 <- TADA_CalculateTotalNP(R5ProfileClean7, daily_agg = "max")
## [1] "Aggregation results:"
##
## No aggregation needed Selected as max aggregate value
## 6565 35
Question 14: How many new Total Nitrogen or Total Phosphorus rows were added? Can users customize nutrient summation?
R5ProfileClean7_n <- nrow(R5ProfileClean7)
R5ProfileClean8_n <- nrow(R5ProfileClean8)
nutrient_n <- R5ProfileClean8_n - R5ProfileClean7_n
total_N_P_subset <- R5ProfileClean8 %>%
dplyr::filter(TADA.NutrientSummation.Flag %in% c("Nutrient summation from one or more subspecies.", "Nutrient summation from one subspecies.")) %>%
dplyr::select(TADA.CharacteristicName, TADA.ResultMeasureValue, TADA.ResultMeasure.MeasureUnitCode, TADA.NutrientSummation.Flag, TADA.NutrientSummationEquation, TADA.NutrientSummationGroup)
The table above shows the first 100 rows added by calculating Total N and P. Columns included in the table are TADA.CharacteristicName, TADA.ResultMeasureValue, TADA.ResultMeasure.MeasureUnitCode, TADA.NutrientSummation.Flag, TADA.NutrientSummationEquation, and TADA.NutrientSummationGroup.
Parameter Level Filtering
In this section, you can select a single parameter, and review the unique values in specified fields. You may then choose to remove particular values by filtering.
To start, review the list of parameters in the dataframe using the TADA_FieldValuesTable function.
Char_Pie <- FieldValuesTable_Chars <- TADA_FieldValuesTable(R5ProfileClean8, field = "TADA.CharacteristicName")
Char_Pie
Next, we can revisit the TADA_FieldCounts function at the characteristic level to review how many unique allowable values are included within each of the following fields:
ActivityCommentText
ActivityTypeCode
TADA.ActivityMediaName
ActivityMediaSubdivisionName
MeasureQualifierCode
MonitoringLocationTypeName
HydrologicCondition
HydrologicEvent
ResultStatusIdentifier
MethodQualifierTypeName
ResultCommentText
ResultLaboratoryCommentText
TADA.ResultMeasure.MeasureUnitCode
TADA.ResultSampleFractionText
ResultTemperatureBasisText
ResultValueTypeName
ResultWeightBasisText
SampleCollectionEquipmentName
LaboratoryName
MethodDescriptionText
ResultParticleSizeBasisText
SampleCollectionMethod.MethodIdentifier
SampleCollectionMethod.MethodIdentifierContext
SampleCollectionMethod.MethodName
DataQuality.BiasValue
MethodSpeciationName
ResultAnalyticalMethod.MethodName
ResultAnalyticalMethod.MethodIdentifier
ResultAnalyticalMethod.MethodIdentifierContext
AssemblageSampledName
DetectionQuantitationLimitTypeName
R5_FieldCounts_Chars <- TADA_FieldCounts(R5ProfileClean8, display = "most", characteristicName = "DISSOLVED OXYGEN (DO)")
Selecting a parameter generates the list above, which is subset by the selected parameter. The list includes fields you may want to review, and the number of unique values in each field.
Next, choose a field from the list.
Review the WQX domain files for definitions: https://www.epa.gov/waterdata/storage-and-retrieval-and-water-quality-exchange-domain-services-and-downloads
Now, we’ll use TADA_FieldValuesTable and TADA_FieldValuesPie at the characteristic-level to review a column of interest. In this example we review values from the SampleCollectionMethod.MethodName field for Dissolved Oxygen (DO) results.
# C Create pie chart for SampleCollectionMethod.MethodName for Dissolved Oxygen (DO results)
DO_SCM_Pie <- TADA_FieldValuesPie(R5ProfileClean8, field = "SampleCollectionMethod.MethodName", characteristicName = "DISSOLVED OXYGEN (DO)")
DO_SCM_Pie
# Create table for SampleCollectionMethod.MethodName for Dissolved Oxygen (DO results)
FieldValuesTable_DO_scm <- TADA_FieldValuesTable(R5ProfileClean8, field = "SampleCollectionMethod.MethodName", characteristicName = "DISSOLVED OXYGEN (DO)")
Generate a scatterplot with two characteristics.
# review unique identifiers
unique(R5ProfileClean8$TADA.ComparableDataIdentifier)
## [1] "TOTAL PHOSPHORUS, MIXED FORMS_UNFILTERED_AS P_UG/L"
## [2] "TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L"
## [3] "TOTAL SUSPENDED SOLIDS_NON-FILTERABLE (PARTICLE)_NA_UG/L"
## [4] "SPECIFIC CONDUCTANCE_NA_NA_US/CM"
## [5] "DISSOLVED OXYGEN (DO)_NA_NA_MG/L"
## [6] "DISSOLVED OXYGEN SATURATION_NA_NA_%"
## [7] "TEMPERATURE_NA_NA_DEG C"
## [8] "PH_NA_NA_NA"
## [9] "TRANSPARENCY, TUBE WITH DISK_NA_NA_IN"
## [10] "DEPTH, BOTTOM_NA_NA_IN"
## [11] "CHLOROPHYLL A, FREE OF PHEOPHYTIN_NON-FILTERABLE (PARTICLE)_NA_UG/L"
## [12] "AMMONIUM_FILTERABLE_AS N_MG/L"
## [13] "NITRATE + NITRITE_FILTERED_AS N_MG/L"
## [14] "TRANSPARENCY, TUBE WITH DISK_TOTAL_NA_IN"
## [15] "FLOW_NA_NA_CFS"
## [16] "ESCHERICHIA COLI_NA_NA_CFU/100ML"
## [17] "PHOSPHATE_UNFILTERED_AS P_UG/L"
## [18] "CLOUD COVER_NA_NA_%"
## [19] "TURBIDITY_NA_NA_NTU"
## [20] "ORTHOPHOSPHATE_FILTERED_AS P_UG/L"
## [21] "TOTAL KJELDAHL NITROGEN (ORGANIC N & NH3)_UNFILTERED_AS N_MG/L"
## [22] "SUSPENDED SEDIMENT CONCENTRATION (SSC)_TOTAL_NA_UG/L"
## [23] "PHOSPHATE_FILTERED_AS P_UG/L"
## [24] "AMMONIA_FILTERED_AS N_MG/L"
## [25] "CHLOROPHYLL A_UNFILTERED_NA_UG/L"
## [26] "CHLORIDE_TOTAL_NA_UG/L"
## [27] "CALCIUM CARBONATE_TOTAL_NA_UG/L"
## [28] "POTASSIUM_TOTAL_NA_UG/L"
## [29] "SODIUM_TOTAL_NA_UG/L"
## [30] "SULFATE_TOTAL_NA_UG/L"
## [31] "MAGNESIUM_TOTAL_NA_UG/L"
## [32] "CALCIUM_TOTAL_NA_UG/L"
## [33] "STREAM STAGE_NA_NA_M"
## [34] "CONDUCTIVITY_NA_NA_US/CM"
## [35] "TRANSPARENCY, SECCHI TUBE WITH DISK_NA_NA_IN"
## [36] "VELOCITY - STREAM_NA_NA_M/SEC"
## [37] "OXIDATION REDUCTION POTENTIAL (ORP)_NA_NA_VOLTS"
## [38] "SALINITY_NA_NA_PSS"
## [39] "FECAL COLIFORM_NA_NA_CFU/100ML"
## [40] "TEMPERATURE, SAMPLE_NA_NA_DEG C"
## [41] "CHLOROPHYLL A, CORRECTED FOR PHEOPHYTIN_TOTAL_NA_UG/L"
## [42] "CHLOROPHYLL A, UNCORRECTED FOR PHEOPHYTIN_TOTAL_NA_UG/L"
## [43] "CHLOROPHYLL B_TOTAL_NA_UG/L"
## [44] "CHLOROPHYLL C_TOTAL_NA_UG/L"
## [45] "PHEOPHYTIN A_TOTAL_NA_UG/L"
## [46] "ORGANIC CARBON_TOTAL_NA_UG/L"
## [47] "AMMONIA_UNFILTERED_AS N_MG/L"
## [48] "NITRATE + NITRITE_UNFILTERED_AS N_MG/L"
## [49] "PHOSPHORUS_DISSOLVED_NA_UG/L"
## [50] "PHENOLS_TOTAL_NA_UG/L"
## [51] "HARDNESS, CA, MG_NA_NA_MG/L"
## [52] "LEAD_TOTAL_NA_UG/L"
## [53] "MANGANESE_TOTAL_NA_UG/L"
## [54] "NICKEL_TOTAL_NA_UG/L"
## [55] "SELENIUM_TOTAL_NA_UG/L"
## [56] "SILVER_TOTAL_NA_UG/L"
## [57] "STRONTIUM_TOTAL_NA_UG/L"
## [58] "VANADIUM_TOTAL_NA_UG/L"
## [59] "ZINC_TOTAL_NA_UG/L"
## [60] "ARSENIC_TOTAL_NA_UG/L"
## [61] "BERYLLIUM_TOTAL_NA_UG/L"
## [62] "BORON_TOTAL_NA_UG/L"
## [63] "CADMIUM_TOTAL_NA_UG/L"
## [64] "CHROMIUM_TOTAL_NA_UG/L"
## [65] "COBALT_TOTAL_NA_UG/L"
## [66] "COPPER_TOTAL_NA_UG/L"
## [67] "ALUMINUM_DISSOLVED_NA_UG/L"
## [68] "ARSENIC_DISSOLVED_NA_UG/L"
## [69] "BARIUM_DISSOLVED_NA_UG/L"
## [70] "BERYLLIUM_DISSOLVED_NA_UG/L"
## [71] "BORON_DISSOLVED_NA_UG/L"
## [72] "CADMIUM_DISSOLVED_NA_UG/L"
## [73] "CALCIUM_DISSOLVED_NA_UG/L"
## [74] "CHROMIUM_DISSOLVED_NA_UG/L"
## [75] "COBALT_DISSOLVED_NA_UG/L"
## [76] "COPPER_DISSOLVED_NA_UG/L"
## [77] "IRON_DISSOLVED_NA_UG/L"
## [78] "LEAD_DISSOLVED_NA_UG/L"
## [79] "MAGNESIUM_DISSOLVED_NA_UG/L"
## [80] "MANGANESE_DISSOLVED_NA_UG/L"
## [81] "NICKEL_DISSOLVED_NA_UG/L"
## [82] "POTASSIUM_DISSOLVED_NA_UG/L"
## [83] "SELENIUM_DISSOLVED_NA_UG/L"
## [84] "SILVER_DISSOLVED_NA_UG/L"
## [85] "SODIUM_DISSOLVED_NA_UG/L"
## [86] "STRONTIUM_DISSOLVED_NA_UG/L"
## [87] "VANADIUM_DISSOLVED_NA_UG/L"
## [88] "ZINC_DISSOLVED_NA_UG/L"
## [89] "CYANIDE, AVAILABLE_NA_NA_UG/L"
## [90] "FLUORIDE_TOTAL_NA_UG/L"
## [91] "VOLATILE SUSPENDED SOLIDS_NA_NA_UG/L"
## [92] "IRON_TOTAL_NA_UG/L"
## [93] "ALUMINUM_TOTAL_NA_UG/L"
## [94] "BARIUM_TOTAL_NA_UG/L"
## [95] "CYANIDE_TOTAL_NA_UG/L"
## [96] "GLYPHOSATE_TOTAL_NA_UG/L"
## [97] "ACETOCHLOR_TOTAL_NA_UG/L"
## [98] "ALACHLOR_TOTAL_NA_UG/L"
## [99] "ALDRIN_TOTAL_NA_UG/L"
## [100] ".ALPHA.-HEXACHLOROCYCLOHEXANE_TOTAL_NA_UG/L"
## [101] "CIS-CHLORDANE_TOTAL_NA_UG/L"
## [102] "CAPTAN_TOTAL_NA_UG/L"
## [103] "CYANAZINE_TOTAL_NA_UG/L"
## [104] "DIELDRIN_TOTAL_NA_UG/L"
## [105] "ENDRIN_TOTAL_NA_UG/L"
## [106] "LINDANE_TOTAL_NA_UG/L"
## [107] "TRANS-CHLORDANE_TOTAL_NA_UG/L"
## [108] "HEPTACHLOR_TOTAL_NA_UG/L"
## [109] "HEPTACHLOR EPOXIDE_TOTAL_NA_UG/L"
## [110] "HEXACHLOROBENZENE_TOTAL_NA_UG/L"
## [111] "METHOXYCHLOR_TOTAL_NA_UG/L"
## [112] "METOLACHLOR_TOTAL_NA_UG/L"
## [113] "METRIBUZIN_TOTAL_NA_UG/L"
## [114] "P,P'-DDD_TOTAL_NA_UG/L"
## [115] "P,P'-DDE_TOTAL_NA_UG/L"
## [116] "P,P'-DDT_TOTAL_NA_UG/L"
## [117] "PENDIMETHALIN_TOTAL_NA_UG/L"
## [118] "TOXAPHENE_TOTAL_NA_UG/L"
## [119] "TRIFLURALIN_TOTAL_NA_UG/L"
## [120] "ATRAZINE_TOTAL_NA_UG/L"
## [121] "BUTYLATE_TOTAL_NA_UG/L"
## [122] "CHLORPYRIFOS_TOTAL_NA_UG/L"
## [123] "DIAZINON_TOTAL_NA_UG/L"
## [124] "S-ETHYL DIPROPYLTHIOCARBAMATE_TOTAL_NA_UG/L"
## [125] "PARATHION_TOTAL_NA_UG/L"
## [126] "FONOFOS_TOTAL_NA_UG/L"
## [127] "MALATHION_TOTAL_NA_UG/L"
## [128] "METHYL PARATHION_TOTAL_NA_UG/L"
## [129] "PHORATE_TOTAL_NA_UG/L"
## [130] "SIMAZINE_TOTAL_NA_UG/L"
## [131] "TERBUFOS_TOTAL_NA_UG/L"
## [132] "POLYCHLORINATED BIPHENYLS_TOTAL_NA_UG/L"
## [133] "2,4-D_TOTAL_NA_UG/L"
## [134] "ACIFLUORFEN_TOTAL_NA_UG/L"
## [135] "DALAPON_TOTAL_NA_UG/L"
## [136] "DICAMBA_TOTAL_NA_UG/L"
## [137] "DINOSEB_TOTAL_NA_UG/L"
## [138] "PENTACHLOROPHENOL_TOTAL_NA_UG/L"
## [139] "PICLORAM_TOTAL_NA_UG/L"
## [140] "SILVEX_TOTAL_NA_UG/L"
## [141] "TOTAL DISSOLVED SOLIDS_NA_NA_UG/L"
## [142] "TOTAL DISSOLVED SOLIDS_TOTAL_NA_UG/L"
## [143] "SPECIFIC CONDUCTANCE_TOTAL_NA_US/CM"
## [144] "OXIDATION REDUCTION POTENTIAL (ORP)_TOTAL_NA_VOLTS"
## [145] "TOTAL SUSPENDED SOLIDS_FILTERED, LAB_NA_UG/L"
## [146] "ALUMINUM_TOTAL RECOVERABLE_NA_UG/L"
## [147] "BARIUM_TOTAL RECOVERABLE_NA_UG/L"
## [148] "BERYLLIUM_TOTAL RECOVERABLE_NA_UG/L"
## [149] "CADMIUM_TOTAL RECOVERABLE_NA_UG/L"
## [150] "CALCIUM_TOTAL RECOVERABLE_NA_UG/L"
## [151] "CHROMIUM_TOTAL RECOVERABLE_NA_UG/L"
## [152] "COBALT_TOTAL RECOVERABLE_NA_UG/L"
## [153] "ESCHERICHIA COLI_NA_NA_MPN"
## [154] "COPPER_TOTAL RECOVERABLE_NA_UG/L"
## [155] "IRON_TOTAL RECOVERABLE_NA_UG/L"
## [156] "LEAD_TOTAL RECOVERABLE_NA_UG/L"
## [157] "MAGNESIUM_TOTAL RECOVERABLE_NA_UG/L"
## [158] "MANGANESE_TOTAL RECOVERABLE_NA_UG/L"
## [159] "MERCURY_DISSOLVED_NA_UG/L"
## [160] "MERCURY_TOTAL RECOVERABLE_NA_UG/L"
## [161] "MOLYBDENUM_DISSOLVED_NA_UG/L"
## [162] "MOLYBDENUM_TOTAL RECOVERABLE_NA_UG/L"
## [163] "NICKEL_TOTAL RECOVERABLE_NA_UG/L"
## [164] "POTASSIUM_TOTAL RECOVERABLE_NA_UG/L"
## [165] "SILVER_TOTAL RECOVERABLE_NA_UG/L"
## [166] "SODIUM_TOTAL RECOVERABLE_NA_UG/L"
## [167] "SULFATE_TOTAL_AS SO4_UG/L"
## [168] "THALLIUM_DISSOLVED_NA_UG/L"
## [169] "THALLIUM_TOTAL RECOVERABLE_NA_UG/L"
## [170] "TITANIUM_DISSOLVED_NA_UG/L"
## [171] "TITANIUM_TOTAL RECOVERABLE_NA_UG/L"
## [172] "ORGANIC CARBON_DISSOLVED_NA_UG/L"
## [173] "VANADIUM_TOTAL RECOVERABLE_NA_UG/L"
## [174] "ZINC_TOTAL RECOVERABLE_NA_UG/L"
## [175] "HEIGHT, GAGE_NA_NA_M"
## [176] "ESCHERICHIA COLI_NA_NA_CFU"
## [177] "BIOCHEMICAL OXYGEN DEMAND, STANDARD CONDITIONS_UNFILTERED_NA_UG/L"
## [178] "CHLORIDE_UNFILTERED_NA_UG/L"
## [179] "CHLOROPHYLL A_NON-FILTERABLE (PARTICLE)_NA_UG/L"
## [180] "FECAL COLIFORM_UNFILTERED_NA_CFU/100ML"
## [181] "INORGANIC CARBON_UNFILTERED_AS C_UG/L"
## [182] "NITRATE_UNFILTERED_AS N_MG/L"
## [183] "NITRITE_UNFILTERED_AS N_MG/L"
## [184] "ORGANIC CARBON_UNFILTERED_AS C_UG/L"
## [185] "ORGANIC CARBON_FILTERED, LAB_AS C_UG/L"
## [186] "TOTAL PHOSPHORUS, MIXED FORMS_FILTERED_AS P_UG/L"
## [187] "VOLATILE SUSPENDED SOLIDS_VOLATILE_NA_UG/L"
## [188] "TOTAL SOLIDS_UNFILTERED_NA_UG/L"
## [189] "ARSENIC_TOTAL RECOVERABLE_NA_UG/L"
## [190] "SELENIUM_TOTAL RECOVERABLE_NA_UG/L"
## [191] "TOTAL HARDNESS_TOTAL RECOVERABLE_NA_MG/L"
## [192] "MAGNESIUM_TOTAL_AS MG_UG/L"
## [193] "CALCIUM_TOTAL_AS CA_UG/L"
## [194] "SULFATE_UNFILTERED_AS SO4_UG/L"
## [195] "CHEMICAL OXYGEN DEMAND_NA_NA_UG/L"
## [196] "CARBONACEOUS BIOCHEMICAL OXYGEN DEMAND, STANDARD CONDITIONS_NA_NA_UG/L"
## [197] "STRONTIUM_TOTAL RECOVERABLE_NA_UG/L"
## [198] "TOTAL COLIFORM_TOTAL_NA_CFU/100ML"
## [199] "TEMPERATURE, AIR_NA_NA_DEG C"
## [200] "STREAM STAGE_TOTAL_NA_M"
## [201] "VOLATILE SUSPENDED SOLIDS_TOTAL_NA_UG/L"
## [202] "SALINITY_TOTAL_NA_UG/KG"
## [203] "HEIGHT, GAGE_TOTAL_NA_M"
## [204] "DISTANCE FROM/TO_TOTAL_NA_IN"
## [205] "HARDNESS, CARBONATE_TOTAL_AS CACO3_UG/L"
## [206] "TEMPERATURE, AIR_TOTAL_NA_DEG C"
## [207] "APPARENT COLOR_TOTAL_NA_PCU"
## [208] "TOTAL VOLATILE SOLIDS_TOTAL_NA_UG/L"
## [209] "CHEMICAL OXYGEN DEMAND_TOTAL_NA_UG/L"
## [210] "BIOCHEMICAL OXYGEN DEMAND, STANDARD CONDITIONS_TOTAL_NA_UG/L"
## [211] "CARBONACEOUS BIOCHEMICAL OXYGEN DEMAND, STANDARD CONDITIONS_TOTAL_NA_UG/L"
## [212] "TOTAL PHOSPHORUS, MIXED FORMS_TOTAL_AS P_UG/L"
## [213] "BORON_TOTAL_AS B_UG/L"
## [214] "ARSENIC_TOTAL_AS AS_UG/L"
## [215] "CHROMIUM_TOTAL_AS CR_UG/L"
## [216] "NICKEL_TOTAL_AS NI_UG/L"
## [217] "SELENIUM_TOTAL_AS SE_UG/L"
## [218] "TOTAL SOLIDS_TOTAL_NA_UG/L"
## [219] "TOTAL DISSOLVED SOLIDS_DISSOLVED_NA_UG/L"
## [220] "IRON_TOTAL_AS FE_UG/L"
## [221] "BENZO[A]PYRENE_TOTAL_NA_UG/L"
## [222] "DI(2-ETHYLHEXYL) ADIPATE_TOTAL_NA_UG/L"
## [223] "DI(2-ETHYLHEXYL) PHTHALATE_TOTAL_NA_UG/L"
## [224] "2-CHORO-6-ETHYLAMINO-4-AMINO-S-TRIAZINE_TOTAL_NA_UG/L"
## [225] "2-CHLORO-4-ISOPROPYLAMINO-6-AMINO-S-TRIAZINE_TOTAL_NA_UG/L"
## [226] "OXYCHLORDANE_TOTAL_NA_UG/L"
## [227] "CLOMAZONE_TOTAL_NA_UG/L"
## [228] "CIS-NONACHLOR_TOTAL_NA_UG/L"
## [229] "PROPACHLOR_TOTAL_NA_UG/L"
## [230] "HEXACHLOROCYCLOPENTADIENE_TOTAL_NA_UG/L"
## [231] "TRANS-NONACHLOR_TOTAL_NA_UG/L"
## [232] "FLUORIDE_TOTAL_AS F_UG/L"
## [233] "METRIBUZIN DA_TOTAL_NA_UG/L"
## [234] "DEETHYLCYANAZINE AMIDE_TOTAL_NA_UG/L"
## [235] "CYANAZINE AMIDE_TOTAL_NA_UG/L"
## [236] "CYANAZINE ACID_TOTAL_NA_UG/L"
## [237] "DEETHYLCYANAZINE ACID_TOTAL_NA_UG/L"
## [238] "HEXAZINONE_TOTAL_NA_UG/L"
## [239] "METRIBUZIN DADK_TOTAL_NA_UG/L"
## [240] "TRICLOPYR_TOTAL_NA_UG/L"
## [241] "METRIBUZIN DK_TOTAL_NA_UG/L"
## [242] "METALAXYL_TOTAL_NA_UG/L"
## [243] "PROPICONAZOLE_TOTAL_NA_UG/L"
## [244] "PROPACHLOR OA_TOTAL_NA_UG/L"
## [245] "FOMESAFEN_TOTAL_NA_UG/L"
## [246] "SULFOMETURON METHYL_TOTAL_NA_UG/L"
## [247] "METSULFURON-METHYL_TOTAL_NA_UG/L"
## [248] "FLUTRIAFOL_TOTAL_NA_UG/L"
## [249] "THIFENSULFURON-METHYL_TOTAL_NA_UG/L"
## [250] "IMAZAPYR_TOTAL_NA_UG/L"
## [251] "IMAZAQUIN_TOTAL_NA_UG/L"
## [252] "IMAZETHAPYR_TOTAL_NA_UG/L"
## [253] "IMAZAMETHABENZ-METHYL_TOTAL_NA_UG/L"
## [254] "TRIASULFURON_TOTAL_NA_UG/L"
## [255] "BENSULFURON-METHYL_TOTAL_NA_UG/L"
## [256] "MYCLOBUTANIL_TOTAL_NA_UG/L"
## [257] "CHLORIMURON-ETHYL_TOTAL_NA_UG/L"
## [258] "FLUMETSULAM_TOTAL_NA_UG/L"
## [259] "TOLFENPYRAD_TOTAL_NA_UG/L"
## [260] "IMAZAMETHABENZ ACID_TOTAL_NA_UG/L"
## [261] "HALOSULFURON-METHYL_TOTAL_NA_UG/L"
## [262] "IMAZAPIC_TOTAL_NA_UG/L"
## [263] "MESOTRIONE_TOTAL_NA_UG/L"
## [264] "TEBUCONAZOLE_TOTAL_NA_UG/L"
## [265] "CLETHODIM SULFOXIDE_TOTAL_NA_UG/L"
## [266] "CLETHODIM SULFONE_TOTAL_NA_UG/L"
## [267] "THIACLOPRID_TOTAL_NA_UG/L"
## [268] "DICROTOPHOS_TOTAL_NA_UG/L"
## [269] "OXYDEMETON-METHYL_TOTAL_NA_UG/L"
## [270] "BROMACIL_TOTAL_NA_UG/L"
## [271] "DIURON_TOTAL_NA_UG/L"
## [272] "LINURON_TOTAL_NA_UG/L"
## [273] "BENSULIDE_TOTAL_NA_UG/L"
## [274] "METHYL PARAOXON_TOTAL_NA_UG/L"
## [275] "DICHLOBENIL_TOTAL_NA_UG/L"
## [276] "PROMETON_TOTAL_NA_UG/L"
## [277] "BENFLURALIN_TOTAL_NA_UG/L"
## [278] "CHLOROTHALONIL_TOTAL_NA_UG/L"
## [279] "TRIALLATE_TOTAL_NA_UG/L"
## [280] "DICHLORVOS_TOTAL_NA_UG/L"
## [281] "DIMETHOATE_TOTAL_NA_UG/L"
## [282] "CARBOFURAN_TOTAL_NA_UG/L"
## [283] "ALDICARB SULFOXIDE_TOTAL_NA_UG/L"
## [284] "ALDICARB SULFONE_TOTAL_NA_UG/L"
## [285] "BROMOXYNIL_TOTAL_NA_UG/L"
## [286] "CLOPYRALID_TOTAL_NA_UG/L"
## [287] "SIDURON_TOTAL_NA_UG/L"
## [288] "2-HYDROXYATRAZINE_TOTAL_NA_UG/L"
## [289] "DISULFOTON SULFONE_TOTAL_NA_UG/L"
## [290] "2-CHLORO-4,6-DIAMINO-S-TRIAZINE_TOTAL_NA_UG/L"
## [291] "CHLORPYRIFOS O.A._TOTAL_NA_UG/L"
## [292] "PROMETRYN_TOTAL_NA_UG/L"
## [293] "OXADIAZON_TOTAL_NA_UG/L"
## [294] "ETHOFUMESATE_TOTAL_NA_UG/L"
## [295] "CARBARYL_TOTAL_NA_UG/L"
## [296] "MECOPROP_TOTAL_NA_UG/L"
## [297] "2,4,5-T_TOTAL_NA_UG/L"
## [298] "MCPA_TOTAL_NA_UG/L"
## [299] "MCPB_TOTAL_NA_UG/L"
## [300] "2,4-DB_TOTAL_NA_UG/L"
## [301] "PROPAZINE_TOTAL_NA_UG/L"
## [302] "DISULFOTON_TOTAL_NA_UG/L"
## [303] "DIAZOXON_TOTAL_NA_UG/L"
## [304] "DICHLORPROP_TOTAL_NA_UG/L"
## [305] "CYANTRANILIPROLE_TOTAL_NA_UG/L"
## [306] "SEDAXANE_TOTAL_NA_UG/L"
## [307] "FLUXAPYROXAD_TOTAL_NA_UG/L"
## [308] "AFIDOPYROPEN_TOTAL_NA_UG/L"
## [309] "HALAUXIFEN-METHYL_TOTAL_NA_UG/L"
## [310] "HALAUXIFEN ACID_TOTAL_NA_UG/L"
## [311] "PROPACHLOR ESA_TOTAL_NA_UG/L"
## [312] "FLUPYRADIFURONE_TOTAL_NA_UG/L"
## [313] "FLUTIANIL_TOTAL_NA_UG/L"
## [314] "OXATHIAPIPROLIN_TOTAL_NA_UG/L"
## [315] "BENZOVINDIFLUPYR_TOTAL_NA_UG/L"
## [316] "TOLPYRALATE_TOTAL_NA_UG/L"
## [317] "PYDIFLUMETOFEN_TOTAL_NA_UG/L"
## [318] "FLUTIANIL OC 56574_TOTAL_NA_UG/L"
## [319] "FLUTIANIL OC 56635_TOTAL_NA_UG/L"
## [320] "BETA CYPERMETHRIN_TOTAL_NA_UG/L"
## [321] "ETHALFLURALIN_TOTAL_NA_UG/L"
## [322] "ESFENVALERATE_TOTAL_NA_UG/L"
## [323] "CYFLUTHRIN_TOTAL_NA_UG/L"
## [324] "BIFENTHRIN_TOTAL_NA_UG/L"
## [325] "DIMETHENAMID_TOTAL_NA_UG/L"
## [326] ".LAMBDA.-CYHALOTHRIN_TOTAL_NA_UG/L"
## [327] "PHOSTEBUPIRIM_TOTAL_NA_UG/L"
## [328] "CARBENDAZIM_TOTAL_NA_UG/L"
## [329] "DEETHYLCYANAZINE_TOTAL_NA_UG/L"
## [330] "DESMETHYLNORFLURAZON_TOTAL_NA_UG/L"
## [331] "BENTAZON_TOTAL_NA_UG/L"
## [332] "NORFLURAZON_TOTAL_NA_UG/L"
## [333] "THIOBENCARB_TOTAL_NA_UG/L"
## [334] "METOLACHLOR ESA_TOTAL_NA_UG/L"
## [335] "ALACHLOR OA_TOTAL_NA_UG/L"
## [336] "MANDESTROBIN_TOTAL_NA_UG/L"
## [337] "PYRACLOSTROBIN_TOTAL_NA_UG/L"
## [338] "ACETOCHLOR ESA_TOTAL_NA_UG/L"
## [339] "BOSCALID_TOTAL_NA_UG/L"
## [340] "ACETOCHLOR OA_TOTAL_NA_UG/L"
## [341] "FLUFENACET OA_TOTAL_NA_UG/L"
## [342] "DIMETHENAMID ESA_TOTAL_NA_UG/L"
## [343] "CLOTHIANIDIN_TOTAL_NA_UG/L"
## [344] "TEMBOTRIONE_TOTAL_NA_UG/L"
## [345] "BICYCLOPYRONE_TOTAL_NA_UG/L"
## [346] "SAFLUFENACIL_TOTAL_NA_UG/L"
## [347] "BICYCLOPYRONE SYN503870_TOTAL_NA_UG/L"
## [348] "DIMETHENAMID OA_TOTAL_NA_UG/L"
## [349] "PYROXASULFONE_TOTAL_NA_UG/L"
## [350] "CHLORANTRANILIPROLE_TOTAL_NA_UG/L"
## [351] "MOMFLUOROTHRIN_TOTAL_NA_UG/L"
## [352] "NICOSULFURON_TOTAL_NA_UG/L"
## [353] "TETRACONAZOLE_TOTAL_NA_UG/L"
## [354] "IMAZAMOX_TOTAL_NA_UG/L"
## [355] "IMIDACLOPRID-OLEFIN_TOTAL_NA_UG/L"
## [356] "PICOXYSTROBIN_TOTAL_NA_UG/L"
## [357] "DIFENOCONAZOLE_TOTAL_NA_UG/L"
## [358] "IMIDACLOPRID-UREA_TOTAL_NA_UG/L"
## [359] "SULFENTRAZONE_TOTAL_NA_UG/L"
## [360] "AZOXYSTROBIN_TOTAL_NA_UG/L"
## [361] "ACETAMIPRID_TOTAL_NA_UG/L"
## [362] "IMIDACLOPRID_TOTAL_NA_UG/L"
## [363] "ISOXAFLUTOLE_TOTAL_NA_UG/L"
## [364] "ALACHLOR ESA_TOTAL_NA_UG/L"
## [365] "BENZENEPROPANENITRILE, .ALPHA.-(CYCLOPROPYLCARBONYL)- 2-(METHYLSULFONYL)-.BETA.- OXO-4-(TRIFLUOROMETHYL)-_TOTAL_NA_UG/L"
## [366] "AMINOPYRALID_TOTAL_NA_UG/L"
## [367] "METOLACHLOR OA_TOTAL_NA_UG/L"
## [368] "THIAMETHOXAM_TOTAL_NA_UG/L"
## [369] "DINOTEFURAN_TOTAL_NA_UG/L"
## [370] "VELOCITY-DISCHARGE_NA_NA_CFS"
Select two characteristics to plot using the id_cols function input.
# choose two and generate scatterplot
TADA_TwoCharacteristicScatterplot(R5ProfileClean8, id_cols = "TADA.ComparableDataIdentifier", groups = c("TEMPERATURE_NA_NA_DEG C", "DISSOLVED OXYGEN (DO)_NA_NA_MG/L"))
Now we will summarize results for a single comparable data group using the TADA.ComparableDataIdentifier (i.e., comparable characteristic, unit, speciation, and fraction combination) using TADA_Histogram and TADA_Boxplot. Note that users may generate a list output of multiple plots if their input dataset has more than one unique comparable data group.
First, filter the data frame to include only “DISSOLVED OXYGEN (DO)_NA_NA_MG/L”
# filter dataframe to only "DISSOLVED OXYGEN (DO)_NA_NA_MG/L"
R5ProfileCleanDO <- dplyr::filter(R5ProfileClean8, TADA.ComparableDataIdentifier == "DISSOLVED OXYGEN (DO)_NA_NA_MG/L")
Generate histogram.
# generate a histogram
TADA_Histogram(R5ProfileCleanDO, id_cols = "TADA.ComparableDataIdentifier")
# generate stats table
R5ProfileCleanDO_stats <- TADA_Stats(R5ProfileCleanDO)
## [1] "Note: Your dataset contains TADA-generated total nutrient results, which have fewer columns populated with metadata. This might affect how groups are displayed in the stats table."
Generate interactive box plot.
TADA_Boxplot(R5ProfileCleanDO, id_cols = "TADA.ComparableDataIdentifier")
Generate a new data frame with only the daily max dissolved oxygen values.
R5ProfileCleanDO_dailymax <- TADA_AggregateMeasurements(R5ProfileCleanDO,
agg_fun = c("max"),
clean = TRUE
)
## [1] "Aggregation results:"
##
## No aggregation needed Selected as max aggregate value
## 155 11
Generate interactive scatterplot of max daily dissolved oxygen over time.
TADA_Scatterplot(R5ProfileCleanDO_dailymax, id_cols = "TADA.ComparableDataIdentifier")
Question 15: Which figure generating functions can be used to visualize more than one characteristic? Which can be used to visualize only one characteristic?
Retain TADA Required Columns
Now we can review the “TADA” prefixed columns we have added to the data set. If we are satisfied with the conversions, filtering, flagging, etc. and the resulting “TADA” columns, we can use the TADA_RetainRequired function to remove any columns that are not required or used as filters in the TADA workflow. This reduces the size of the data frame.
R5ProfileClean9 <- TADA_RetainRequired(R5ProfileClean8)
## [1] "TADA_RetainRequired: removing columns not required for TADA workflow including original columns that have been replaced with TADA prefix duplicates."
## [1] "The following non-required columns were removed: ActivityEndDate, ActivityEndTime.Time, ActivityEndTime.TimeZoneCode, ActivityEndDateTime, ActivityConductingOrganizationText, ActivityLocation.LatitudeMeasure, ActivityLocation.LongitudeMeasure, ResultWeightBasisText, ResultTemperatureBasisText, AnalysisStartDate, ResultDetectionQuantitationLimitUrl, ActivityStartTime.TimeZoneCode_offset, ActivityEndTime.TimeZoneCode_offset, SourceMapScaleNumeric, HorizontalCollectionMethodName, VerticalMeasure.MeasureValue, VerticalMeasure.MeasureUnitCode, VerticalCollectionMethodName, VerticalCoordinateReferenceSystemDatumName, ProviderName and LastUpdated."
Question 16: How many columns were removed during TADA_RetainRequired?
TADA Shiny Application
Finally, take a look at an alternative workflow, TADA Shiny Module 1: Data Discovery and Cleaning. This is a Shiny application that runs many of the TADA functions covered in this document behind a graphical user interface. The shiny application queries the WQP, contains maps and data visualizations, flags suspect data results, handles censored data, and more. You can launch it using the code below.
DRAFT Module 1 is also currently hosted on the web with minimal server memory/storage allocated.
# download TADA Shiny repository
remotes::install_github("USEPA/TADAShiny",
ref = "develop",
dependencies = TRUE
)
# launch the app locally.
TADAShiny::run_app()