TADA Module 1: Water Quality Portal Data Discovery and Cleaning
TADA Team
2024-12-27
Source:vignettes/TADAModule1.Rmd
TADAModule1.Rmd
Overview
This vignette walks through how to use the TADA R Package to discover and clean (i.e., wrangle, Quality Assure and Quality Control (QAQC), and harmonize) Water Quality Portal (WQP) data from multiple organizations.
Install and load packages
First, install and load the remotes package specifying the repo. This is needed before installing TADA because it is only available on GitHub (not CRAN).
install.packages("remotes",
repos = "http://cran.us.r-project.org"
)
library(remotes)
Next, install and load TADA using the remotes package. USGS’s dataRetrieval and other TADA R Package dependencies will also be downloaded automatically from CRAN with the TADA install. If desired, the development version of dataRetrieval can be downloaded directly from GitHub (un-comment).
remotes::install_github("USEPA/EPATADA",
ref = "develop",
dependencies = TRUE
)
# remotes::install_github("USGS-R/dataRetrieval", dependencies=TRUE)
Finally, use the library() function to load the TADA R Package into your R session.
TADA_DataRetrieval
WQP data is retrieved and processed for compatibility with TADA. This function, TADA_DataRetrieval, builds on USGS’s dataRetrieval R package functions. It joins three WQP profiles: Site, Sample Results (physical/chemical metadata), and Project. In addition, it changes all data in the Characteristic, Speciation, Fraction, and Unit fields to uppercase and addresses result values that include special characters.
This function uses the same inputs as the dataRetrieval
readWQPdata
function. readWQPdata
does not
restrict the characteristics pulled from Water Quality Portal
(WQP).
Data retrieval filters include:
startDate
endDate
characteristicName
sampleMedia
siteType
statecode (review list of possible state and territory abbreviations)
countycode
siteid
organization
project
huc
characteristicType
The default TADA_DataRetrieval function automatically runs the TADA_AutoClean function. In this example, we will set TADA_AutoClean = FALSE and run it as a separate step in the work flow.
Tips:
-
All the query filters for the WQP work as an AND but within the fields there are ORs. For example:
Characteristics: If you choose pH & DO - it’s an OR. This means you will retrieve both pH OR DO data if available.
States: Similarly, if you choose VA and IL, it’s an OR. This means you will retrieve both VA OR IL data if available.
Combinations of fields are ANDs, such as State/VA AND Characteristic/DO”. This means you will receive all DO data available in VA.
“Characteristic” and “Characteristic Type” also work as an AND. This means that the Characteristic must fall within the CharacteristicGroup if both filters are being used, if not you will get an error.
The “siteid” is a general term WQP uses to describe both Site IDs from USGS databases and Monitoring Location Identifiers (from WQX). Each monitoring location in the Water Quality Portal (WQP) has a unique Monitoring Location Identifier, regardless of the database from which it derives. The Monitoring Location Identifier from the WQP is the concatenated Organization Identifier plus the Site ID number. Site IDs that only include a number are only unique identifiers for monitoring locations within USGS NWIS or EPA’s WQX databases separately.
Additional resources:
Review function documentation by entering the following code into the console: ?TADA_DataRetrieval
Use the code below to download data from the WQP using TADA_DataRetrieval. Edit the code chuck below to define your own WQP query inputs.
Downloads using TADA_DataRetrieval will have the same columns each time, but be aware that data are uploaded to the Water Quality Portal by individual organizations, which may or may not follow the same conventions. Data and metadata quality are not guaranteed! Carefully explore data to make sure it meets your quality assurance requirements.
Note: TADA_DataRetrieval (by leveraging dataRetrieval), automatically converts the date times to UTC. It also automatically converts field formats to dates, datetimes, and numerics based on a standard algorithm.
Enter ?TADA_DataRetrieval into the console to review example queries and additional information.
This example includes monitoring data collected from Jan 2018 to Jan 2019 by six organizations: 1) Red Lake Band of Chippewa Indians, 2) Sac & Fox Nation, 3) Pueblo of Pojoaque, 4) Minnesota Chippewa Tribe (Fond du Lac Band), 5) Pueblo of Tesuque, and 6) The Chickasaw Nation
We will move forward with this example in the remainder of the vignette.
TADAProfile <- TADA_DataRetrieval(organization = c("REDLAKE_WQX", "SFNOES_WQX", "PUEBLO_POJOAQUE", "FONDULAC_WQX", "PUEBLOOFTESUQUE", "CNENVSER"), startDate = "2018-01-01", endDate = "2019-01-01", applyautoclean = FALSE)
## [1] "Downloading WQP query results. This may take some time depending upon the query size."
## $startDate
## [1] "2018-01-01"
##
## $organization
## [1] "REDLAKE_WQX" "SFNOES_WQX" "PUEBLO_POJOAQUE" "FONDULAC_WQX"
## [5] "PUEBLOOFTESUQUE" "CNENVSER"
##
## $endDate
## [1] "2019-01-01"
## Error in get(paste0(generic, ".", class), envir = get_method_env()) :
## object 'type_sum.accel' not found
USGS dataRetrieval
Uncomment below (optional) if you would like to review differences between the profiles you would get using USGS’s readWQPdata vs. EPA’s TADA_DataRetrieval (compare dataRetrieval_example to TADAProfile). The profiles are different because TADA_DataRetrieval automatically joins in data from multiple WQP profiles, and does some additional data cleaning as part of the data retrieval process.
# dataRetrieval_example <- dataRetrieval::readWQPdata(organization = c("REDLAKE_WQX", "SFNOES_WQX", "PUEBLO_POJOAQUE", "FONDULAC_WQX", "PUEBLOOFTESUQUE", "CNENVSER"), startDate = "2018-01-01", endDate = "2019-01-01", ignore_attributes = TRUE)
TADA_BigDataRetrieval
If you need to download a large amount of data from across a large area, and the TADA_DataRetrieval function is not working due to WQP timeout issues, then the TADA_BigDataRetrieval function may work better.
This function does multiple synchronous data calls to the WQP (waterqualitydata.us). It uses the WQP summary service to limit the sites downloaded to only those with relevant data. It pulls back data from set number of stations at a time and then joins the data back together to produce a single TADA compatible dataframe as the output.
See ?TADA_BigDataRetrieval for more details. WARNING, some of the examples below can take multiple HOURS to run. The total run time depends on your query inputs.
# AK_AL_WaterTemp <- TADA_BigDataRetrieval(startDate = "2000-01-01", endDate = "2022-12-31", characteristicName = "Temperature, water", statecode = c("AK","AL"))
# AllWaterTemp <- TADA_BigDataRetrieval(characteristicName = "Temperature, water")
# AllPhosphorus <- TADA_BigDataRetrieval(characteristicName = "Phosphorus")
# AllCT <- TADA_BigDataRetrieval(statecode = "CT")
Filter data based on media type
Some TADA users are interested in using WQP data for surface water only or for analysis of some non-water data. The TADA_AnalysisDataFilter function can assist in identifying results of interest. Multiple columns are used to identify groundwater results as different organizations may populate different combinations of fields in order to identify a result as groundwater.
This function identifies surface water, groundwater, and sediment results. Users can specify whether all results should be returned with a new column, TADA.UseForAnalysis.Flag,identifying if the result should be included in further analysis or if only results that should be in included are returned.
The defaults are to include surface water, exclude groundwater and sediment, and to return only the results that should be used for analysis (clean = TRUE). This is shown in the active example below. If you would like to see all results with the TADA.UseForAnalysis.Flag column, you can uncomment the example where clean = FALSE.
If you are not interested in using TADA_AnalysisDataFilter, but would like to filter by activity media, uncomment the example to filter for water data only by using dplyr::filter() with TADA.ActivityMediaName.
# Filter to retain only results for use in analysis
TADAProfile <- TADA_AnalysisDataFilter(TADAProfile,
clean = TRUE,
surface_water = TRUE,
ground_water = FALSE,
sediment = FALSE
)
## [1] "TADA_AnalysisDataFilter: Identifying groundwater results."
## [1] "TADA_AnalysisDataFilter: Flagging surface water results to include in assessments."
## [1] "TADA_AnalysisDataFilter: Flagging groundwater results to exclude from assessments."
## [1] "TADA_AnalysisDataFilter: Flagging sediment results to exclude from assessments."
## [1] "TADA_AnalysisDataFilter: Removing results flagged for exclusion from assessments."
# Add TADA.UseForAnalysis.Flag column to identify which results should be used for analysis
# TADAProfile <- TADA_AnalysisDataFilter(TADAProfile, clean = FALSE)
# Remove data for non-water media types, alternate workflow without using TADA_AnalysisDataFilter()
# TADAProfile <- dplyr::filter(TADAProfile, TADA.ActivityMediaName == "WATER")
TADA_AutoClean
Now TADA_AutoClean can be run on a smaller dataset after unnecessary results have been removed. It performs the following functions on the data retrieved from the WQP:
TADA_ConvertSpecialChars - converts result value columns to numeric and flags non-numeric values that could not be converted.
TADA_ConvertResultUnits - unifies result units for easier quality control and review
TADA_ConvertDepthUnits - converts depth units to a consistent unit (meters).
TADA_IDCensoredData - categorizes detection limit data and identifies mismatches in result detection condition and result detection limit type.
Other helpful actions - converts important text columns to all upper-case letters, removes exact duplicates, and uses WQX format rules to harmonize specific NWIS metadata conventions (e.g. move characteristic speciation from the TADA.ResultMeasure.MeasureUnitCode column to the TADA.MethodSpeciationName column)
As a general rule, TADA functions do not change any contents in the WQP-served columns. Instead, they add new columns with the prefix “TADA.” The following columns are numeric versions of their WQP origins:
- TADA.ResultMeasureValue
- TADA.DetectionQuantitationLimitMeasure.MeasureValue
- TADA.LatitudeMeasure
- TADA.LongitudeMeasure
These functions also add the columns TADA.ResultMeasureValueDataTypes.Flag and TADA.DetectionQuantitationLimitMeasure.MeasureValueDataTypes.Flag, which provide information about the result values that is needed to address censored data later on (i.e., nondetections). Specifically, these new columns flag if special characters are included in result values, and specifies what the special characters are.
# run TADA_AutoClean on filtered dataset to convert special characters, result units, and depth units and identify censored data.
TADAProfile <- TADA_AutoClean(TADAProfile)
## [1] "TADA_Autoclean: creating TADA-specific columns."
## [1] "TADA_Autoclean: harmonizing dissolved oxygen characterisic name to DISSOLVED OXYGEN SATURATION if unit is % or % SATURATN."
## [1] "TADA_Autoclean: handling special characters and coverting TADA.ResultMeasureValue and TADA.DetectionQuantitationLimitMeasure.MeasureValue value fields to numeric."
## [1] "TADA_Autoclean: converting TADA.LatitudeMeasure and TADA.LongitudeMeasure fields to numeric."
## [1] "TADA_Autoclean: harmonizing synonymous unit names (m and meters) to m."
## [1] "TADA_Autoclean: updating deprecated (i.e. retired) characteristic names."
## [1] "8 results in your dataset have one of the following deprecated characteristic names: Inorganic nitrogen (nitrate and nitrite) ***retired***use Nitrate + Nitrite. These names have been substituted with the updated preferred names in the TADA.CharacteristicName field."
## [1] "TADA_Autoclean: harmonizing result and depth units."
## [1] "TADA_Autoclean: creating TADA.ComparableDataIdentifier field for use when generating visualizations and analyses."
## [1] "NOTE: This version of the TADA package is designed to work with numeric data with media name: 'WATER'. TADA_AutoClean does not currently remove (filter) data with non-water media types. If desired, the user must make this specification on their own outside of package functions. Example: dplyr::filter(.data, TADA.ActivityMediaName == 'WATER')"
Review all column names in the TADA Profile to familiarize yourself with the dataset after TADA_AutoClean has added additional TADA prefixed columns. TADA_SummarizeColumn summarizes the data set based on the user specified column and returns a dataframe displaying the number of sites and number of records for each unique value in the specified column. The example below uses TADA.CharacteristicName.
# View column names for TADAProfile
colnames(TADAProfile)
## [1] "ActivityTypeCode"
## [2] "ActivityMediaName"
## [3] "TADA.ActivityMediaName"
## [4] "ActivityMediaSubdivisionName"
## [5] "ResultSampleFractionText"
## [6] "TADA.ResultSampleFractionText"
## [7] "CharacteristicName"
## [8] "TADA.CharacteristicName"
## [9] "SubjectTaxonomicName"
## [10] "SampleTissueAnatomyName"
## [11] "MethodSpeciationName"
## [12] "TADA.MethodSpeciationName"
## [13] "TADA.ComparableDataIdentifier"
## [14] "ActivityStartDate"
## [15] "ActivityStartTime.Time"
## [16] "ActivityStartTime.TimeZoneCode"
## [17] "ActivityStartDateTime"
## [18] "ResultMeasureValue"
## [19] "ResultMeasure.MeasureUnitCode"
## [20] "TADA.ResultMeasureValue"
## [21] "TADA.ResultMeasure.MeasureUnitCode"
## [22] "TADA.WQXResultUnitConversion"
## [23] "ResultValueTypeName"
## [24] "TADA.ResultMeasureValueDataTypes.Flag"
## [25] "ResultDetectionConditionText"
## [26] "DetectionQuantitationLimitTypeName"
## [27] "DetectionQuantitationLimitMeasure.MeasureValue"
## [28] "DetectionQuantitationLimitMeasure.MeasureUnitCode"
## [29] "TADA.DetectionQuantitationLimitMeasure.MeasureValue"
## [30] "TADA.DetectionQuantitationLimitMeasure.MeasureUnitCode"
## [31] "TADA.DetectionQuantitationLimitMeasure.MeasureValueDataTypes.Flag"
## [32] "ResultDepthHeightMeasure.MeasureValue"
## [33] "TADA.ResultDepthHeightMeasure.MeasureValue"
## [34] "TADA.ResultDepthHeightMeasure.MeasureValueDataTypes.Flag"
## [35] "ResultDepthHeightMeasure.MeasureUnitCode"
## [36] "TADA.ResultDepthHeightMeasure.MeasureUnitCode"
## [37] "ResultDepthAltitudeReferencePointText"
## [38] "ActivityRelativeDepthName"
## [39] "ActivityDepthHeightMeasure.MeasureValue"
## [40] "TADA.ActivityDepthHeightMeasure.MeasureValue"
## [41] "TADA.ActivityDepthHeightMeasure.MeasureValueDataTypes.Flag"
## [42] "ActivityDepthHeightMeasure.MeasureUnitCode"
## [43] "TADA.ActivityDepthHeightMeasure.MeasureUnitCode"
## [44] "ActivityTopDepthHeightMeasure.MeasureValue"
## [45] "TADA.ActivityTopDepthHeightMeasure.MeasureValue"
## [46] "TADA.ActivityTopDepthHeightMeasure.MeasureValueDataTypes.Flag"
## [47] "ActivityTopDepthHeightMeasure.MeasureUnitCode"
## [48] "TADA.ActivityTopDepthHeightMeasure.MeasureUnitCode"
## [49] "ActivityBottomDepthHeightMeasure.MeasureValue"
## [50] "TADA.ActivityBottomDepthHeightMeasure.MeasureValue"
## [51] "TADA.ActivityBottomDepthHeightMeasure.MeasureValueDataTypes.Flag"
## [52] "ActivityBottomDepthHeightMeasure.MeasureUnitCode"
## [53] "TADA.ActivityBottomDepthHeightMeasure.MeasureUnitCode"
## [54] "ResultTimeBasisText"
## [55] "StatisticalBaseCode"
## [56] "ResultFileUrl"
## [57] "ResultAnalyticalMethod.MethodName"
## [58] "ResultAnalyticalMethod.MethodDescriptionText"
## [59] "ResultAnalyticalMethod.MethodIdentifier"
## [60] "ResultAnalyticalMethod.MethodIdentifierContext"
## [61] "ResultAnalyticalMethod.MethodUrl"
## [62] "SampleCollectionMethod.MethodIdentifier"
## [63] "SampleCollectionMethod.MethodIdentifierContext"
## [64] "SampleCollectionMethod.MethodName"
## [65] "SampleCollectionMethod.MethodDescriptionText"
## [66] "SampleCollectionEquipmentName"
## [67] "MeasureQualifierCode"
## [68] "ResultStatusIdentifier"
## [69] "ResultCommentText"
## [70] "ActivityCommentText"
## [71] "HydrologicCondition"
## [72] "HydrologicEvent"
## [73] "DataQuality.PrecisionValue"
## [74] "DataQuality.BiasValue"
## [75] "DataQuality.ConfidenceIntervalValue"
## [76] "DataQuality.UpperConfidenceLimitValue"
## [77] "DataQuality.LowerConfidenceLimitValue"
## [78] "SamplingDesignTypeCode"
## [79] "LaboratoryName"
## [80] "ResultLaboratoryCommentText"
## [81] "ResultIdentifier"
## [82] "ActivityIdentifier"
## [83] "OrganizationIdentifier"
## [84] "OrganizationFormalName"
## [85] "ProjectName"
## [86] "ProjectDescriptionText"
## [87] "ProjectIdentifier"
## [88] "ProjectFileUrl"
## [89] "QAPPApprovedIndicator"
## [90] "QAPPApprovalAgencyName"
## [91] "CountryCode"
## [92] "StateCode"
## [93] "CountyCode"
## [94] "MonitoringLocationName"
## [95] "MonitoringLocationTypeName"
## [96] "MonitoringLocationDescriptionText"
## [97] "LatitudeMeasure"
## [98] "TADA.LatitudeMeasure"
## [99] "LongitudeMeasure"
## [100] "TADA.LongitudeMeasure"
## [101] "HorizontalCoordinateReferenceSystemDatumName"
## [102] "HUCEightDigitCode"
## [103] "MonitoringLocationIdentifier"
## [104] "AquiferName"
## [105] "AquiferTypeName"
## [106] "LocalAqfrName"
## [107] "ConstructionDateText"
## [108] "WellDepthMeasure.MeasureValue"
## [109] "WellDepthMeasure.MeasureUnitCode"
## [110] "WellHoleDepthMeasure.MeasureValue"
## [111] "WellHoleDepthMeasure.MeasureUnitCode"
## [112] "ActivityDepthAltitudeReferencePointText"
## [113] "ActivityEndDate"
## [114] "ActivityEndTime.Time"
## [115] "ActivityEndTime.TimeZoneCode"
## [116] "ActivityEndDateTime"
## [117] "ActivityConductingOrganizationText"
## [118] "SampleAquifer"
## [119] "ActivityLocation.LatitudeMeasure"
## [120] "ActivityLocation.LongitudeMeasure"
## [121] "ResultWeightBasisText"
## [122] "ResultTemperatureBasisText"
## [123] "ResultParticleSizeBasisText"
## [124] "USGSPCode"
## [125] "BinaryObjectFileName"
## [126] "BinaryObjectFileTypeCode"
## [127] "AnalysisStartDate"
## [128] "ResultDetectionQuantitationLimitUrl"
## [129] "LabSamplePreparationUrl"
## [130] "ActivityStartTime.TimeZoneCode_offset"
## [131] "ActivityEndTime.TimeZoneCode_offset"
## [132] "SourceMapScaleNumeric"
## [133] "HorizontalAccuracyMeasure.MeasureValue"
## [134] "HorizontalAccuracyMeasure.MeasureUnitCode"
## [135] "HorizontalCollectionMethodName"
## [136] "VerticalMeasure.MeasureValue"
## [137] "VerticalMeasure.MeasureUnitCode"
## [138] "VerticalAccuracyMeasure.MeasureValue"
## [139] "VerticalAccuracyMeasure.MeasureUnitCode"
## [140] "VerticalCollectionMethodName"
## [141] "VerticalCoordinateReferenceSystemDatumName"
## [142] "FormationTypeText"
## [143] "ProjectMonitoringLocationWeightingUrl"
## [144] "DrainageAreaMeasure.MeasureValue"
## [145] "DrainageAreaMeasure.MeasureUnitCode"
## [146] "ContributingDrainageAreaMeasure.MeasureValue"
## [147] "ContributingDrainageAreaMeasure.MeasureUnitCode"
## [148] "ProviderName"
## [149] "LastUpdated"
# Review the number of sites and number of records for each CharacteristicName in TADAProfile
TADAProfile_CharSummary <- TADA_SummarizeColumn(TADAProfile, "TADA.CharacteristicName")
# View TADAProfile_CharSummary
TADAProfile_CharSummary
## # A tibble: 104 × 3
## TADA.CharacteristicName n_sites n_records
## <chr> <int> <int>
## 1 .ALPHA.-ENDOSULFAN 6 7
## 2 .BETA.-ENDOSULFAN 6 7
## 3 .BETA.-HEXACHLOROCYCLOHEXANE 6 7
## 4 .DELTA.-HEXACHLOROCYCLOHEXANE 6 7
## 5 ALDRIN 6 7
## 6 ALKALINITY, TOTAL 126 688
## 7 ALPHA PARTICLE 6 14
## 8 ALUMINUM 6 7
## 9 AMMONIA-NITROGEN 83 328
## 10 AMMONIUM 16 93
## # ℹ 94 more rows
Invalid coordinates
Review station locations and summary information using the TADA_OverviewMap function. TADA_OverviewMap counts the number of unique results, characteristics, and organizations at each monitoring location in the dataset and creates a tidy map for reviewing summary stats spatially. Larger point sizes indicate more results collected at a given site, while darker blue colors indicate more unique characteristics collected at the site. Users may click on a site to view a pop-up with this summary information, including the number of organizations that reported results at that site. This map may inform a user’s decision to remove/correct sites that are outside the US.
TADA_OverviewMap(TADAProfile)
The TADA TADA_FlagCoordinates function identifies and flags potentially invalid coordinate data. While its functionality is showcased here, it is always important to review any invalid outputs before cleaning to reduce the risk of leaving out usable data/sites.
Allowable values for clean_outsideUSA are “no”, “remove”, or “change sign”. The default is “no” which flags latitude and longitude coordinates outside the USA. Assigning clean_ousideUSA = “remove” will remove rows of data with coordinates outside the USA. And assigning clean_outsideUSA = “change sign” will flip the sign of latitude or longitude coordinates flagged as outside the USA. The “change sign” option should only be used when it is known that coordinates were entered with the wrong sign in WQX; additionally, the data owner should fix these incorrect coordinates in the raw data through the WQX - for assistance email the WQX help desk: WQX@epa.gov
Allowable values for clean_imprecise are TRUE or FALSE. The default is FALSE which flags rows of data with invalid or imprecise coordinates without removing them. Assigning clean_imprecise = TRUE will remove rows of data with invalid or imprecise coordinates.
Allowable values for flaggedonly are TRUE or FALSE. The default is FALSE which keeps all rows of data regardless of flag status. Assigning flaggedonly = TRUE filters the dataframe to show only rows of data which are flagged.
When clean_outsideUSA = “no” and/or clean_imprecise = FALSE, a column will be appended titled “TADA.InvalidCoordinates.Flag” with the following flags (if relevant to dataframe):
If the latitude is less than zero, the row will be flagged with “LAT_OutsideUSA”. (Exception for American Samoa)
If the longitude is greater than zero AND less than 145, the row will be flagged as “LONG_OutsideUSA”. (Exceptions for Guam and the Northern Mariana Islands)
If the latitude or longitude contains the string, “999”, the row will be flagged as invalid.
Finally, precision can be measured by the number of decimal places in the latitude and longitude provided. If either does not have any numbers to the right of the decimal point, the row will be flagged as “Imprecise”.
# flag only
TADAProfileClean1 <- TADA_FlagCoordinates(TADAProfile, clean_outsideUSA = "no", clean_imprecise = FALSE, flaggedonly = FALSE)
# review unique flags in TADAProfileClean1
unique(TADAProfileClean1$TADA.SuspectCoordinates.Flag)
# review unique MonitoringLocationIdentifiers in your flag dataframe
unique(TADAProfileClean1$MonitoringLocationIdentifier)
Unique_SuspectCoordinateFlags <- TADAProfileClean1 %>%
dplyr::select(
"MonitoringLocationIdentifier",
"MonitoringLocationName",
"TADA.SuspectCoordinates.Flag",
"OrganizationIdentifier",
"TADA.LongitudeMeasure",
"TADA.LatitudeMeasure",
"MonitoringLocationTypeName",
"CountryCode",
"StateCode",
"CountyCode",
"HUCEightDigitCode",
"MonitoringLocationDescriptionText",
"ProjectName",
"ProjectIdentifier",
"OrganizationFormalName"
) %>%
dplyr::distinct()
Unique_SuspectCoordinateFlags
# if needed, un-comment below to change the sign for all data for sites flagged as outside the USA. You can also change FALSE to TRUE if you want to remove sites outside of the US or sites with imprecise lat/longs
# TADAProfileClean1 <- TADA_FlagCoordinates(TADAProfile, clean_outsideUSA = "change sign", clean_imprecise = FALSE, flaggedonly = FALSE)
Depth unit conversions
The TADA_ConvertDepthUnits function converts depth units to a consistent unit. Depth values and units are most commonly associated with lake data, and are populated in the ActivityDepthHeightMeasure, ActivityTopDepthHeightMeasure, ActivityBottomDepthHeightMeasure, and ResultDepthHeightMeasure Result Value/Unit columns.
Allowable values for ‘unit’ are either ‘m’ (meter), ‘ft’ (feet), or ‘in’ (inch). ‘unit’ accepts only one allowable value as an input. Default is unit = “m”.
Note that upon download using TADA_DataRetrieval, all depth columns are converted to meters by default. However, the user may choose to run the TADA_ConvertDepthUnits function on their dataset to convert to another unit. See function documentation for additional input options by entering the following code in the console: ?TADA_ConvertDepthUnits
# converts all depth profile data to meters
TADAProfileClean1 <- TADA_ConvertDepthUnits(TADAProfileClean1,
unit = "ft",
transform = TRUE
)
Continuous (time series) data
Continuous data may (or may not) be suitable for integration with discrete water quality data for analyses. Therefore, the TADA_FlagContinuousData function was developed to flag rows with continuous data.
See function documentation for additional details by entering the following code in the console: ?TADA_FlagContinuousData
TADAProfileClean1 <- TADA_FlagContinuousData(TADAProfileClean1,
clean = FALSE,
flaggedonly = FALSE,
time_difference = 4
)
# uncomment below to create a dataframe of only the continuous data
# TADAProfile_onlycont <- TADA_FlagContinuousData(TADAProfileClean1, clean = FALSE, flaggedonly = TRUE, time_difference = 4)
WQX Quality Assurance and Quality Control (QAQC) Service Result Flags
Run the following result functions to address suspect method, fraction, speciation, and unit metadata by characteristic. The default is clean = TRUE, which will remove suspect results. You can change this to clean = FALSE to flag results, but not remove them.
See documentation for more details:
-
?TADA_FlagMethod
When clean = FALSE, this function adds the following column to your dataframe: TADA.AnalyticalMethod.Flag. This column flags invalid TADA.CharacteristicName, ResultAnalyticalMethod/MethodIdentifier, and ResultAnalyticalMethod/MethodIdentifierContext combinations in your dataframe either “NonStandardized”, “Suspect”, or “Pass”.
When clean = TRUE, “Suspect” rows are removed from the dataframe and no column will be appended.
When flaggedonly = TRUE, the dataframe is filtered to only the rows flagged as “Suspect”; default is flaggedonly = FALSE.
-
?TADA_FlagSpeciation
-
When clean = “none”, this function adds the following column to your dataframe: TADA.MethodSpeciation.Flag. This column flags each TADA.CharacteristicName and MethodSpeciationName combination in your dataframe as either “NonStandardized”,
“Suspect”, or “Pass”.
When clean = “suspect_only”, only “Suspect” rows are removed from the dataframe. Default is clean = “suspect_only”.
When clean = “nonstandardized_only”, only “NonStandardized” rows are removed from the dataframe.
When clean = “both”, “Invalid” and “NonStandardized” rows are removed from the dataframe.
When clean = “none”, no rows are removed from the dataframe.
When flaggedonly = TRUE, the dataframe is filtered to only the rows flagged as “Suspect” or “NonStandardized”; default is flaggedonly = FALSE.
-
-
?TADA_FlagResultUnit
When clean = FALSE, the following column will be added to your dataframe: TADA.ResultUnit.Flag. This column flags each TADA.CharacteristicName, TADA.ActivityMediaName, and TADA.ResultMeasure.MeasureUnitCode combination in your dataframe as either “NonStandardized”, “Invalid”, or “Valid”.
When clean = TRUE, “Suspect” rows are removed from the dataframe and no column will be appended.
When flaggedonly = TRUE, the dataframe is filtered to only the rows flagged as “Suspect”; default is flaggedonly = FALSE.
-
?TADA_FlagFraction
- When clean = FALSE, this function adds the following column to your dataframe: TADA.SampleFraction.Flag. This column flags each TADA.CharacteristicName and TADA.ResultSampleFractionText combination in your dataframe as either “NonStandardized”, “Suspect”, or “Pass”.
- When clean = TRUE, “Suspect” rows are removed from the dataframe and no column will be appended.
- When flaggedonly = TRUE, the dataframe is filtered to only the rows flagged as “Suspect”; default is flaggedonly = FALSE.
TADAProfileClean2 <- TADA_FlagMethod(TADAProfileClean1, clean = TRUE)
TADAProfileClean2 <- TADA_FlagFraction(TADAProfileClean2, clean = TRUE)
TADAProfileClean2 <- TADA_FlagSpeciation(TADAProfileClean2, clean = "suspect_only")
TADAProfileClean2 <- TADA_FlagResultUnit(TADAProfileClean2, clean = "suspect_only")
WQX national upper and lower thresholds
Run the following code to flag or remove results that are above or below the national upper and lower bound for each characteristic and unit combination. See documentation for more details:
-
?TADA_FlagAboveThreshold
When clean = FALSE, the following column is added to your dataframe: TADA.ResultValueAboveUpperThreshold.Flag. This column flags rows with data that are above the upper WQX threshold. The default is clean = FALSE.
When clean = TRUE, data that is above the upper WQX threshold is removed from the dataframe.
When flaggedonly = TRUE, the dataframe is filtered to only the rows flagged as above the upper WQX threshold; default is flaggedonly = FALSE.
-
?TADA_FlagBelowThreshold
When clean = FALSE, the following column is added to your dataframe: TADA.ResultValueBelowLowerThreshold.Flag. This column flags rows with data that are below the lower WQX threshold. The default is clean = FALSE.
When clean = TRUE, data that is below the lower WQX threshold is removed from the dataframe.
When flaggedonly = TRUE, the dataframe is filtered to only the rows flagged as below the lower WQX threshold; default is flaggedonly = FALSE.
TADAProfileClean3 <- TADA_FlagAboveThreshold(TADAProfileClean2, clean = TRUE)
TADAProfileClean3 <- TADA_FlagBelowThreshold(TADAProfileClean3, clean = TRUE)
Potential duplicates
Sometimes multiple organizations submit the exact same data to Water Quality Portal (WQP), which can affect water quality analyses and assessments. Similarly, organizations occasionally submit the same data multiple times to the Portal. The following functions check for and identify data that may be duplicates based on date, time, characteristic, result value, and a distance buffer. Each pair or group of potential duplicate rows is flagged with a unique ID. For more information, review the documentation by entering the following into the console:
- ?TADA_FindPotentialDuplicatesMultipleOrgs
- ?TADA_FindPotentialDuplicatesSingleOrg
TADAProfileClean3 <- TADA_FindPotentialDuplicatesSingleOrg(TADAProfileClean3)
## [1] "156 groups of potentially duplicated results found in dataset. These have been placed into duplicate groups in the TADA.SingleOrgDupGroupID column and the function randomly selected one result from each group to represent a single, unduplicated value. Selected values are indicated in the TADA.SingleOrgDup.Flag as 'Unique', while duplicates are flagged as 'Duplicate' for easy filtering."
TADAProfileClean3 <- TADA_FindPotentialDuplicatesMultipleOrgs(TADAProfileClean3,
dist_buffer = 100,
org_hierarchy = "none"
)
## [1] "No duplicate results detected. Returning input dataframe with duplicate flagging columns set to 'N'."
Review QAPP information
The TADA_FindQAPPApproval function checks data for an approved QAPP.
This function checks to see if there is any information in the column “QAPPApprovedIndicator”. Some organizations submit data for this field to indicate if the data produced has an approved Quality Assurance Project Plan (QAPP) or not. In this field, Y indicates yes, N indicates no.
This function has three default inputs: clean = TRUE, cleanNA = FALSE, and flaggedonly = FALSE. These defaults remove rows of data where the QAPPApprovedIndicator equals “N”.
Users could alternatively remove both N’s and NA’s using the inputs clean = TRUE, cleanNA = TRUE, and flaggedonly = FALSE.
Additionally, users could filter to show only N’s and NA’s by using the inputs clean = FALSE, cleanNA = FALSE, and flaggedonly = TRUE.
If clean = FALSE, cleanNA = FALSE, and flaggedonly = FALSE, the function will not do anything.
TADAProfileClean3 <- TADA_FindQAPPApproval(TADAProfileClean3, clean = FALSE, cleanNA = FALSE)
## [1] "Data is flagged but not removed because clean and cleanNA were FALSE"
The TADA_FindQAPPDoc function checks to see if a QAPP Doc is Available
This function checks data submitted under the “ProjectFileUrl” column to determine if a QAPP document is available to review. When clean = FALSE, a column will be appended to flag results that do have an associated QAPP document URL provided. When clean = TRUE, rows that do not have an associated QAPP document are removed from the dataframe and no column will be appended. When flaggedonly = TRUE, the dataframe is filtered to show only rows that do not have an associated QAPP document. The defaults are clean = FALSE and flaggedonly = FALSE. This function should only be used to remove data if an accompanying QAPP document is required to use data in assessments.
TADAProfileClean3 <- TADA_FindQAPPDoc(TADAProfileClean3,
clean = FALSE
)
## [1] "No QAPP document url data found in your dataframe. Returning input dataframe with TADA.QAPPDocAvailable column for tracking."
Full Dataframe Filtering
In this section a TADA user will want to review the unique values in specific fields and may choose to remove data with particular values.
To start, review the list of common fields used for filtering, and the number of unique values in each field using the TADA_FieldCounts function.
This function returns counts for you entire data frame for each of the following fields (if populated, columns that are populated only with NA’s are not included in the output):
ActivityTypeCode
TADA.ActivityMediaName
ActivityMediaSubdivisionName
ActivityCommentText
MonitoringLocationTypeName
StateCode
OrganizationFormalName
TADA.CharacteristicName
HydrologicCondition
HydrologicEvent
BiologicalIntentName
MeasureQualifierCode
ActivityGroup
AssemblageSampledName
ProjectName
CharacteristicNameUserSupplied
DetectionQuantitationLimitTypeName
SampleTissueAnatomyName
LaboratoryName
# multiple options
# print table to console
TADA_FieldCounts(TADAProfileClean3)
## Fields Count
## 1 ActivityCommentText 164
## 2 ResultCommentText 129
## 3 TADA.ComparableDataIdentifier 126
## 4 TADA.CharacteristicName 100
## 5 ResultAnalyticalMethod.MethodName 45
## 6 ProjectName 14
## 7 SubjectTaxonomicName 13
## 8 SampleCollectionMethod.MethodName 12
## 9 ProjectDescriptionText 7
## 10 LaboratoryName 7
## 11 TADA.ResultSampleFractionText 7
## 12 OrganizationFormalName 6
## 13 TADA.MethodSpeciationName 6
## 14 DetectionQuantitationLimitTypeName 5
## 15 ActivityTypeCode 4
## 16 StateCode 4
## 17 ActivityMediaSubdivisionName 2
## 18 MonitoringLocationTypeName 2
## 19 TADA.ActivityMediaName 1
# create object of table
fieldCounts_Table <- TADA_FieldCounts(TADAProfileClean3)
Next, choose a field from the list generated above to view a summary table or pie chart of the counts of unique values in that field using TADA_FieldValuesTable or TADA_FieldValuesPie. We’ll start with ActivityTypeCode.
TADA_FieldValuesTable(TADAProfileClean3, field = "ActivityTypeCode")
## Value Count
## 1 Field Msr/Obs 20090
## 2 Sample-Routine 4448
## 3 Quality Control Sample-Field Replicate 208
## 4 Quality Control Sample-Equipment Blank 36
TADA_FieldValuesPie(TADAProfileClean3, field = "ActivityTypeCode")
The ActivityTypeCode field has multiple unique values. Before we remove the QC samples/measurements from this dataset to prepare for analyses, lets review flagged Quality Control (QC) values using the TADA_FindQCActivities function, which adds a new TADA TADA.ActivityType.Flag column.
For example, the new QC_replicate flag in TADA.ActivityType.Flag column indicates that the flagged rows include any of the following replicate values: - Quality Control Field Replicate Habitat Assessment - Quality Control Field Replicate Msr/Obs - Quality Control Field Replicate Portable Data Logger - Quality Control Field Replicate Sample-Composite - Quality Control Sample-Field Replicate
See WQX domain file to review all the ActivityTypeCode allowable values: https://cdx.epa.gov/wqx/download/DomainValues/ActivityType.CSV
# Review flagged QC samples using the TADA_FindQCActivities function:
# enter ?TADA_FindQCActivities into the console for more information
TADAProfileClean3a <- TADA_FindQCActivities(TADAProfileClean3,
clean = FALSE,
flaggedonly = TRUE
)
# Filter to review only data where the TADA.ActivityType.Flag = "QC_replicate"
TADAProfileClean3a <- dplyr::filter(TADAProfileClean3a, TADA.ActivityType.Flag == "QC_replicate")
Now, let’s run TADA_PairReplicates to see if any replicates in this dataframe can be paired with their original (parent) samples/measurements.
We found over 100 replicates in this dataframe that have a paired parent sample/measurement (based on a 10-minute time window, which can be adjusted if desired). Enter ?TADA_PairReplicates into the console for more details.
What are replicate samples and how are they used in water analyses?
Replicate field samples are samples taken to assess the reproducibility of the sampling technique or analytical method. They are independently carried through all the steps of the sampling and measurement process in an identical manner to their associated routine field sample and used to measure the precision of the total sampling method.
Theoretically, the analysis of a replicate field sample should yield a very similar result as its associated routine field sample. If the results are not the same or acceptably similar, it could signal possible contamination or other issues in the sampling chain. However, water quality can vary at very small scales. So, the field replicate can mix up analytical precision with small scale variability. Field replicates tell you the potential for your method to yield the same results at a single time and place, to the extent that you are actually in exactly the same place, and the few seconds (or any defined time window) from one sample to the next does not matter, and the water isn’t moving. Be careful about labeling data as imprecise or bad based on this alone.
Users of TADA have noted that it would be useful to incorporate replicate field samples into water quality data analysis by (a) flagging routine field sample measurements whose associated replicate field sample measurements are outside of a user-defined window of precision (relative percent difference or absolute difference) and/or (b) averaging or randomly replacing routine field sample measurements with their associated replicate field sample measurements.
For now, users can perform these subsequent analyses outside of TADA. A two-stage data-quality-indicator, where low values should be within the absolute difference limit and high values within the Relative Percent Difference (RPD) limit, may be appropriate. RPD is the calculated difference (RPD) between the routine sample result and its associated replicate sample result. For example, if the RPD/CV exceeds 20% some water quality, analysts consider that to be a potentially concerning lack of precision, especially for non-particulate analytes. However, depending on the characteristic being analyzed and the sampling method, acceptable RPDs can vary widely. Therefore, it is best for the user to define their own level of RPD acceptability. In addition, a tiered approach may be more appropriate, where the widely used 20% RPD for measurements can be used for results above XX-times the detection limit, but also an absolute difference approach can be used for those result-values near the detection limit, or lower than the detection limit (e.g., phosphorus). An absolute difference approach is more appropriate when implementing RPD for samples close to the detection limit, as even small absolute differences might show up as large relative percent differences that “fail” the 20% RPD test.
For example, when nutrient concentrations are close to detection limit, it becomes impossible to have a low RPD. In this scenario, high RPD’s are acceptable because if you stand back and look at ALL the data, and not just the replicates, these data may be agreeing perfectly well that nutrients are very low. DO NOT throw out data if RPD is >20%, unless you have good reason, or you will potentially bias your data toward high concentrations. QA procedures should not bias statistical analyses of the data. Note that a modest error in a measurement will have a much smaller effect than implementing a QA process that builds in bias.
# Run TADA_PairReplicates to add new TADA.ReplicateSampleID column
TADAProfileClean3b <- TADA_PairReplicates(TADAProfileClean3)
# Review unique values in TADA.ReplicateSampleID
unique(TADAProfileClean3b$TADA.ReplicateSampleID)
# Filter df to include only unique values that are paired replicate samples (parent-result and child-replicate).
# Exclude NA's
TADAProfileClean3b <- TADAProfileClean3b[!is.na(TADAProfileClean3b$TADA.ReplicateSampleID), ]
# Exclude orphans
TADAProfileClean3b <- dplyr::filter(TADAProfileClean3b, TADA.ReplicateSampleID != "Orphan")
# Review unique values in TADA.ReplicateSampleID
unique(TADAProfileClean3b$TADA.ReplicateSampleID)
Now, let’s remove QC samples/measurements from the dataframe.
# Remove flagged QC samples using the TADA_FindQCActivities function:
TADAProfileClean4 <- TADA_FindQCActivities(TADAProfileClean3,
clean = TRUE
)
## [1] "Quality control samples have been removed or were not present in the input dataframe. Returning dataframe with TADA.ActivityType.Flag column for tracking."
# regenerate table and pie chart
TADA_FieldValuesTable(TADAProfileClean4, "ActivityTypeCode")
## Value Count
## 1 Field Msr/Obs 20090
## 2 Sample-Routine 4448
TADA_FieldValuesPie(TADAProfileClean4, "ActivityTypeCode")
We’ve completed our review of the ActivityTypeCode.
Now, let’s move on to a different field and see if there are any values that we want to remove.
In this next example, there are multiple MeasureQualifierCode values to review.
TADA_FieldValuesPie(TADAProfileClean4, "MeasureQualifierCode")
MeasureQualifierCode definitions are available here.
In this example, we show how to use the function TADA_FlagMeasureQualifierCode to add MeasureQualifierCode definitions and flag and/or remove rows with specific codes under MeasureQualifierCode that are categorized as “SUSPECT”.
See ?TADA_FlagMeasureQualifierCode for more information.
# flag only
Review_TADAProfileClean4 <- TADA_FlagMeasureQualifierCode(TADAProfileClean4,
clean = FALSE,
flaggedonly = TRUE,
define = TRUE
)
# Review_TADAProfileClean4 is empty because we did not find any Suspect samples
TADAProfileClean4 <- TADA_FlagMeasureQualifierCode(TADAProfileClean4,
clean = TRUE
)
# regenerate table and pie chart
TADA_FieldValuesPie(TADAProfileClean4, field = "MeasureQualifierCode")
Censored data
Censored data are measurements for which the true value is not known, but we can estimate the value based on lower or upper detection conditions and limit types. TADA fills missing TADA.ResultMeasureValue and TADA.ResultMeasure.MeasureUnitCode values with values and units from TADA.DetectionQuantitationLimitMeasure.MeasureValue and TADA.DetectionQuantitationLimitMeasure.MeasureUnitCode, respectively, using the TADA_IDCensoredData function.
- TADA_IDCensoredData - categorizes detection limit data and identifies mismatches in result detection condition and result detection limit type. This function runs within the TADA_SimpleCensoredMethods function.
In other words, detection limit information is copied and pasted into the result value column when the original value is NA and detection limit information is available. The two columns TADA focuses on to define and flag censored data are ResultDetectionConditionText and DetectionQuantitationLimitTypeName.
The TADA package currently has functions that summarize censored data incidence in the dataset and perform simple substitutions of censored data values, including x times the detection limit and random selection of a value between 0 and the detection limit. The user may specify the methods used for non-detects and over-detects separately in the input to the TADA_SimpleCensoredMethods function.
All censored data functions depend first on the TADA_IDCensoredData utility function, which assigns a TADA.CensoredData.Flag to all data records and identifies over-detects from non-detects using the ResultDetectionConditionText and DetectionQuantitationLimitTypeName. This utility function is automatically run within the TADA_DataRetrieval function and produces the TADA.CensoredData.Flag column. All records receive one of the following classifications: - Uncensored - Not filled with detection limit value; a detection. - Non-Detect - Left-censored - Over-Detect - Right-censored - Other Condition/Limit Populated - detection condition or limit type are ambiguous or not associated with a lower/upper detection limit. - Conflict between Condition and Limit - detection condition and limit type for a single record do not agree, e.g. one suggests over-detect and the other suggests non-detect. - Detection condition or detection limit is not documented in TADA reference tables. - detection condition or limit type is not characterized in the TADA reference tables, which are based on WQX domain tables. - Detection condition is missing and required for censored data ID. - Result needs more information before being categorized.
The TADA_SimpleCensoredMethods function also adds a TADA.MeasureQualifierCode.Def column which contains the MeasureQualiferCode concatenated with the WQX definition for each qualifier code. This provides additional information to the user which may assist in deciding which records to retain for analysis.
The next step we take in this example is to perform simple conversions to the censored data in the dataset: we keep over-detects as is (no conversion made) and convert non-detect values to 0.5 times the detection limit (half the detection limit). Please review ?TADA_Stats and ?TADA_SimpleCensoredMethods for more information.
TADAProfileClean4 <- TADA_SimpleCensoredMethods(TADAProfileClean4,
nd_method = "multiplier",
nd_multiplier = 0.5,
od_method = "as-is",
od_multiplier = "null"
)
## [1] "TADA_IDCensoredData: There are 22 results in your data frame that are missing ResultDetectionConditionText. TADA requires BOTH ResultDetectionConditionText and DetectionQuantitationLimitTypeName fields to be populated in order to categorize censored data."
Next, review unique values within the TADA.CensoredData.Flag, DetectionQuantitationLimitTypeName, and ResultDetectionConditionText columns.
# review unique values
unique(TADAProfileClean4$TADA.CensoredData.Flag)
## [1] "Uncensored"
## [2] "Detection condition is missing and required for censored data ID."
## [3] "Non-Detect"
unique(TADAProfileClean4$DetectionQuantitationLimitTypeName)
## [1] NA "Method Detection Level"
## [3] "Practical Quantitation Limit" "Lower Reporting Limit"
## [5] "Upper Quantitation Limit"
unique(TADAProfileClean4$ResultDetectionConditionText)
## [1] NA "Present Below Quantification Limit"
## [3] "Not Detected" "Not Detected at Reporting Limit"
Also, review the TADA.ResultMeasureValueDataTypes.Flag to see if any NA’s or ND’s (non-detects) remain.
unique(TADAProfileClean4$TADA.ResultMeasureValueDataTypes.Flag)
## [1] "Numeric"
## [2] "Result Value/Unit Copied from Detection Limit"
## [3] "Result Value/Unit Estimated from Detection Limit"
## [4] "Text"
## [5] "NA - Not Available"
Count how many NA’s remain in the TADA.ResultMeasureValue.
## [1] 472
Filter down to only numeric data. Remove data where the TADA.ResultMeasureValueDataTypes.Flag = “Text” or “NA - Not Available”. You can also remove any columns not required for the TADA workflow that contain only NAs. The TADA_AutoFilter() function removes non-numeric data and optional columns containing only NAs.
# Removes rows where the result value is not numeric. Specifically, removes rows with "Text" or "NA - Not Available" in the TADA.ResultMeasureValueDataTypes.Flag column, or NA in the TADA.ResultMeasureValue column. Removes optional columns containing only NAs.
TADAProfileClean5 <- TADA_AutoFilter(TADAProfileClean4)
## [1] "TADA_Autofilter: removing columns not required for TADA workflow if they contain only NAs."
## [1] "The following column(s) were removed as they contained only NAs: ActivityDepthAltitudeReferencePointText, SampleAquifer, ResultWeightBasisText, ResultTemperatureBasisText, ResultParticleSizeBasisText, USGSPCode, BinaryObjectFileName, BinaryObjectFileTypeCode, LabSamplePreparationUrl, HorizontalAccuracyMeasure.MeasureValue, HorizontalAccuracyMeasure.MeasureUnitCode, VerticalMeasure.MeasureValue, VerticalMeasure.MeasureUnitCode, VerticalAccuracyMeasure.MeasureValue, VerticalAccuracyMeasure.MeasureUnitCode, VerticalCollectionMethodName, VerticalCoordinateReferenceSystemDatumName, FormationTypeText, ProjectMonitoringLocationWeightingUrl, DrainageAreaMeasure.MeasureValue, DrainageAreaMeasure.MeasureUnitCode, ContributingDrainageAreaMeasure.MeasureValue and ContributingDrainageAreaMeasure.MeasureUnitCode."
## [1] "TADA_Autofilter: checking required columns for non-NA values."
## [1] "TADA Required column(s) SampleTissueAnatomyName, ResultDepthAltitudeReferencePointText, ResultTimeBasisText, StatisticalBaseCode, ResultFileUrl, ResultAnalyticalMethod.MethodUrl, HydrologicCondition, HydrologicEvent, DataQuality.PrecisionValue, DataQuality.BiasValue, DataQuality.ConfidenceIntervalValue, SamplingDesignTypeCode, ResultLaboratoryCommentText, ProjectFileUrl, AquiferName, AquiferTypeName, LocalAqfrName, ConstructionDateText, WellDepthMeasure.MeasureValue, WellDepthMeasure.MeasureUnitCode, WellHoleDepthMeasure.MeasureValue and WellHoleDepthMeasure.MeasureUnitCode contain only NA values. This may impact other TADA functions."
## [1] "Function removed 472 results. These results are either text or NA and cannot be plotted or represent quality control activities (not routine samples or measurements)."
Double check to make sure no NA’s or ND’s remain.
unique(TADAProfileClean5$TADA.ResultMeasureValueDataTypes.Flag)
## [1] "Numeric"
## [2] "Result Value/Unit Copied from Detection Limit"
## [3] "Result Value/Unit Estimated from Detection Limit"
## [1] 0
Convert synonymous characteristic, fraction, speciation, and unit values to a consistent convention based on user-defined/TADA standards
The TADA_GetSynonymRef function generates a synonym reference table that is specific to the input dataframe. Users can review how their input data relates to standard TADA values for the following elements:
TADA.CharacteristicName
TADA.ResultSampleFractionText
TADA.MethodSpeciationName
TADA.ResultMeasure.MeasureUnitCode
Users can also edit the reference file to meet their needs if desired. The download argument can be used to save the harmonization file to your current working directory when download = TRUE, the default is download = FALSE.
The TADA_HarmonizeSynonyms function then compares the input dataframe to the TADA Synonym Reference Table and makes conversions where target characteristics/fractions/speciations/units are provided. This function also appends a column called TADA.Harmonized.Flag, indicating which results had metadata changed/converted in this function. The purpose of this function is to make similar data consistent and therefore easier to compare and analyze.
Here are some examples of how the TADA_HarmonizeSynonyms function can be used:
TADA.ResultSampleFractionText specifies forms of constituents. In some cases, a single TADA.CharacteristicName will have both “Total” and “Dissolved” forms specified, which should not be combined. In these cases, each TADA.CharacteristicName and TADA.ResultSampleFractionText combination is given a different identifier. This identifier can be used later on to identify comparable data groups for calculating statistics and creating figures for each combination.
Some variables have different names but represent the same constituent (e.g., “Total Kjeldahl nitrogen (Organic N & NH3)” and “Kjeldahl nitrogen”). The TADA_HarmonizeSynonyms function gives a consistent name (and identifier) to synonyms.
UniqueHarmonizationRef <- TADA_GetSynonymRef(TADAProfileClean5)
TADAProfileClean5 <- TADA_HarmonizeSynonyms(TADAProfileClean5,
ref = UniqueHarmonizationRef
)
Total Nitrogen and Total Phosphorus Calculations
This section covers summing nutrient subspecies to estimate total nitrogen and phosphorus. This can be a challenging endeavor because some subspecies/compounds overlap in total nutrient calculations. Thus, TADA_CalculateTotalNP uses the Nutrient Aggregation logic to add together specific subspecies to obtain a total. TADA adds one more equation to the mix: total particulate nitrogen + total dissolved nitrogen. The function uses as many subspecies as possible to calculate a total for each given site, date, and depth group, but it will estimate total nitrogen with whatever subspecies are present. This function creates NEW total nutrient measurements (total nitrogen unfiltered as N and total phosphorus unfiltered as P) and adds them to the dataframe.
Users can use the default summation worksheet (see TADA_GetNutrientSummationRef) or customize it to suit their needs. The function also requires a daily aggregation value, either minimum, maximum, or mean. The default is ‘max’, which means that if multiple measurements of the same subspecies-fraction-speciation-unit occur on the same day at the same site and depth, the function will pick the maximum value to use in summation calculations.
TADAProfileClean6 <- TADA_CalculateTotalNP(TADAProfileClean5, daily_agg = "max")
## [1] "Aggregation results:"
##
## No aggregation needed Selected as max aggregate value
## 20122 1019
Parameter Level Filtering
In this section, you can select a single parameter, and review the unique values in specified fields. You may then choose to remove particular values by filtering.
To start, review the list of parameters in the dataframe using the TADA_FieldValuesTable function.
Enter ?TADA_FieldValuesTable into the console for more information.
TADA_FieldValuesTable(TADAProfileClean6, field = "TADA.CharacteristicName")
## Value Count
## 1 DISSOLVED OXYGEN (DO) 3580
## 2 PH 3580
## 3 TEMPERATURE 3579
## 4 CONDUCTIVITY 2723
## 5 TURBIDITY 1228
## 6 TOTAL PHOSPHORUS, MIXED FORMS 1190
## 7 DISSOLVED OXYGEN SATURATION 729
## 8 TOTAL DISSOLVED SOLIDS 724
## 9 SPECIFIC CONDUCTANCE 690
## 10 TOTAL NITROGEN, MIXED FORMS 596
## 11 DEPTH 481
## 12 BAROMETRIC PRESSURE 391
## 13 SALINITY 379
## 14 FLOW 363
## 15 DEPTH, SECCHI DISK DEPTH 331
## 16 TOTAL KJELDAHL NITROGEN (ORGANIC N & NH3) 323
## 17 STREAM STAGE 318
## 18 AMMONIA 311
## 19 NITRATE 247
## 20 CHLOROPHYLL A, CORRECTED FOR PHEOPHYTIN 236
## 21 CHLORIDE 223
## 22 ORTHOPHOSPHATE 216
## 23 TRANSPARENCY, SECCHI TUBE WITH DISK 212
## 24 ESCHERICHIA COLI 188
## 25 NITRATE + NITRITE 156
## 26 TOTAL SUSPENDED SOLIDS 148
## 27 CONDITION CLASS (DISSOLVED OXYGEN (DO)) 130
## 28 TURBIDITY FIELD 129
## 29 SULFATE 125
## 30 TEMPERATURE, SAMPLE 124
## 31 NITRITE 120
## 32 HARDNESS, CA, MG 106
## 33 AMMONIUM 93
## 34 DISSOLVED OXYGEN UPTAKE 85
## 35 COPPER 60
## 36 APPARENT COLOR 57
## 37 CHROMIUM 54
## 38 MERCURY 52
## 39 PHEOPHYTIN A 52
## 40 CHLOROPHYLL A 49
## 41 SELENIUM 43
## 42 VOLATILE SUSPENDED SOLIDS 38
## 43 ARSENIC 37
## 44 CADMIUM 37
## 45 LEAD 37
## 46 NICKEL 37
## 47 ZINC 37
## 48 DEPTH, SNOW COVER 33
## 49 HARDNESS, CARBONATE 29
## 50 IRON 29
## 51 ICE THICKNESS 28
## 52 BIOCHEMICAL OXYGEN DEMAND, STANDARD CONDITIONS 23
## 53 CHEMICAL OXYGEN DEMAND 23
## 54 FLUORIDE 23
## 55 SILICON 23
## 56 TOTAL HARDNESS 23
## 57 COUNT 20
## 58 MANGANESE 19
## 59 ALPHA PARTICLE 12
## 60 RADIUM-226 10
## 61 RADIUM-228 10
## 62 TRITIUM 10
## 63 PERIPHYTON 7
## 64 .ALPHA.-ENDOSULFAN 6
## 65 .BETA.-ENDOSULFAN 6
## 66 .BETA.-HEXACHLOROCYCLOHEXANE 6
## 67 .DELTA.-HEXACHLOROCYCLOHEXANE 6
## 68 ALDRIN 6
## 69 ALUMINUM 6
## 70 BENZENE HEXACHLORIDE, ALPHA (BHC) 6
## 71 BERYLLIUM 6
## 72 BHC, .BETA.-BHC & .GAMMA.-BHC MIX, UNSPECIFIED 6
## 73 BORON 6
## 74 CALCIUM 6
## 75 CHLORDANE 6
## 76 COBALT 6
## 77 DIELDRIN 6
## 78 ENDOSULFAN SULFATE 6
## 79 ENDRIN 6
## 80 ENDRIN ALDEHYDE 6
## 81 HEPTACHLOR 6
## 82 HEPTACHLOR EPOXIDE 6
## 83 MAGNESIUM 6
## 84 METHOXYCHLOR 6
## 85 MOLYBDENUM 6
## 86 P,P'-DDD 6
## 87 P,P'-DDE 6
## 88 P,P'-DDT 6
## 89 POTASSIUM 6
## 90 SILVER 6
## 91 SODIUM 6
## 92 THALLIUM 6
## 93 TOXAPHENE 6
## 94 URANIUM 6
## 95 VANADIUM 6
## 96 ORGANIC CARBON 4
## 97 BARIUM 3
## 98 KJELDAHL NITROGEN 1
Next, we can revisit the TADA_FieldCounts function at the characteristic level to review how many unique allowable values are included within each of the following fields:
ActivityCommentText
ActivityTypeCode
TADA.ActivityMediaName
ActivityMediaSubdivisionName
MeasureQualifierCode
MonitoringLocationTypeName
HydrologicCondition
HydrologicEvent
ResultStatusIdentifier
MethodQualifierTypeName
ResultCommentText
ResultLaboratoryCommentText
TADA.ResultMeasure.MeasureUnitCode
TADA.ResultSampleFractionText
ResultTemperatureBasisText
ResultValueTypeName
ResultWeightBasisText
SampleCollectionEquipmentName
LaboratoryName
MethodDescriptionText
ResultParticleSizeBasisText
SampleCollectionMethod.MethodIdentifier
SampleCollectionMethod.MethodIdentifierContext
SampleCollectionMethod.MethodName
DataQuality.BiasValue
MethodSpeciationName
ResultAnalyticalMethod.MethodName
ResultAnalyticalMethod.MethodIdentifier
ResultAnalyticalMethod.MethodIdentifierContext
AssemblageSampledName
DetectionQuantitationLimitTypeName
TADA_FieldCounts(TADAProfileClean6, display = "most", characteristicName = "TOTAL PHOSPHORUS, MIXED FORMS")
## Fields Count
## 1 MonitoringLocationIdentifier 134
## 2 MonitoringLocationName 134
## 3 TADA.LongitudeMeasure 127
## 4 TADA.LatitudeMeasure 124
## 5 TADA.MonitoringLocationIdentifier 122
## 6 ResultDetectionQuantitationLimitUrl 108
## 7 ProjectIdentifier 12
## 8 ProjectName 12
## 9 HUCEightDigitCode 9
## 10 ResultCommentText 8
## 11 CountyCode 8
## 12 SampleCollectionMethod.MethodIdentifier 7
## 13 SampleCollectionMethod.MethodName 7
## 14 SampleCollectionMethod.MethodDescriptionText 6
## 15 OrganizationIdentifier 5
## 16 OrganizationFormalName 5
## 17 SampleCollectionMethod.MethodIdentifierContext 5
## 18 SampleCollectionEquipmentName 5
## 19 ProjectDescriptionText 5
## 20 ResultAnalyticalMethod.MethodIdentifier 4
## 21 ResultAnalyticalMethod.MethodName 4
## 22 LaboratoryName 4
## 23 DetectionQuantitationLimitTypeName 4
## 24 StateCode 4
## 25 ActivityRelativeDepthName 3
## 26 ResultStatusIdentifier 3
## 27 ResultAnalyticalMethod.MethodIdentifierContext 3
## 28 ResultAnalyticalMethod.MethodDescriptionText 3
## 29 MonitoringLocationTypeName 3
## 30 MonitoringLocationDescriptionText 3
## 31 HorizontalCoordinateReferenceSystemDatumName 3
## 32 QAPPApprovalAgencyName 3
## 33 ActivityTypeCode 2
## 34 ActivityMediaSubdivisionName 2
## 35 ResultDetectionConditionText 2
## 36 ResultValueTypeName 2
## 37 ProviderName 2
## 38 HorizontalCollectionMethodName 2
## 39 CountryCode 2
## 40 TADA.CharacteristicNameAssumptions 2
## 41 TADA.MeasureQualifierCode.Def 2
## 42 TADA.ActivityMediaName 1
## 43 TADA.CharacteristicName 1
## 44 TADA.MethodSpeciationName 1
## 45 TADA.ResultSampleFractionText 1
## 46 TADA.ComparableDataIdentifier 1
Selecting a parameter generates the list above, which is subset by the selected parameter. The list includes fields you may want to review, and the number of unique values in each field.
Next, choose a field from the list.
Review the WQX domain files for definitions: https://www.epa.gov/waterdata/storage-and-retrieval-and-water-quality-exchange-domain-services-and-downloads
Now, we’ll use TADA_FieldValuesTable and TADA_FieldValuesPie at the characteristic-level to review a column of interest.
# In this example we review values from the SampleCollectionMethod.MethodName field
TADA_FieldValuesTable(TADAProfileClean6, field = "SampleCollectionMethod.MethodName", characteristicName = "TOTAL PHOSPHORUS, MIXED FORMS")
## Value Count
## 1 Integrated Sampler 296
## 2 FDL QAPP 105
## 3 Standard Sampling Methods 94
## 4 PUEBLOOFTESUQUE_QAPP 64
## 5 Hand Dipper 30
## 6 Grab 1
TADA_FieldValuesPie(TADAProfileClean6, field = "SampleCollectionMethod.MethodName", characteristicName = "TOTAL PHOSPHORUS, MIXED FORMS")
Generate a scatterplot with two
# review unique identifiers
unique(TADAProfileClean6$TADA.ComparableDataIdentifier)
## [1] "DISSOLVED OXYGEN (DO)_NA_NA_MG/L"
## [2] "PH_NA_NA_NA"
## [3] "TEMPERATURE_NA_NA_DEG C"
## [4] "TURBIDITY_NA_NA_NTU"
## [5] "AMMONIUM_UNFILTERED_AS N_MG/L"
## [6] "NITRATE_UNFILTERED_NA_MG/L"
## [7] "CHLORIDE_TOTAL_NA_UG/L"
## [8] "FLOW_NA_NA_CFS"
## [9] "ESCHERICHIA COLI_NA_NA_CFU/100ML"
## [10] "ESCHERICHIA COLI_NA_NA_MPN"
## [11] "NITRITE_UNFILTERED_NA_MG/L"
## [12] "ORTHOPHOSPHATE_UNFILTERED_AS P_UG/L"
## [13] "DEPTH_NA_NA_M"
## [14] "DISSOLVED OXYGEN SATURATION_NA_NA_%"
## [15] "SPECIFIC CONDUCTANCE_NA_NA_US/CM"
## [16] "TOTAL DISSOLVED SOLIDS_NA_NA_UG/L"
## [17] "SALINITY_NA_NA_PSS"
## [18] "SULFATE_TOTAL_NA_UG/L"
## [19] "AMMONIA_UNFILTERED_AS N_MG/L"
## [20] "BIOCHEMICAL OXYGEN DEMAND, STANDARD CONDITIONS_TOTAL_NA_UG/L"
## [21] "FLUORIDE_TOTAL_NA_UG/L"
## [22] "CHROMIUM_TOTAL_NA_UG/L"
## [23] "COPPER_TOTAL_NA_UG/L"
## [24] "IRON_TOTAL_NA_UG/L"
## [25] "MANGANESE_TOTAL_NA_UG/L"
## [26] "SILICON_TOTAL_NA_UG/L"
## [27] "HARDNESS, CARBONATE_TOTAL_NA_UG/L"
## [28] "TOTAL HARDNESS_TOTAL_NA_MG/L"
## [29] "CHEMICAL OXYGEN DEMAND_TOTAL_NA_UG/L"
## [30] "TOTAL PHOSPHORUS, MIXED FORMS_UNFILTERED_AS P_UG/L"
## [31] "CHLOROPHYLL A, CORRECTED FOR PHEOPHYTIN_SUSPENDED_NA_UG/L"
## [32] "CONDUCTIVITY_NA_NA_US/CM"
## [33] "DEPTH, SECCHI DISK DEPTH_NA_NA_M"
## [34] "STREAM STAGE_NA_NA_M"
## [35] "TRANSPARENCY, SECCHI TUBE WITH DISK_NA_NA_IN"
## [36] "ICE THICKNESS_NA_NA_IN"
## [37] "DEPTH, SNOW COVER_NA_NA_IN"
## [38] "ORTHOPHOSPHATE_FILTERED_AS P_UG/L"
## [39] "BAROMETRIC PRESSURE_NA_NA_G/M2"
## [40] "TOTAL DISSOLVED SOLIDS_TOTAL_NA_UG/L"
## [41] "TOTAL KJELDAHL NITROGEN (ORGANIC N & NH3)_UNFILTERED_NA_MG/L"
## [42] "NITRITE_UNFILTERED_AS N_MG/L"
## [43] "NITRATE_UNFILTERED_AS N_MG/L"
## [44] "KJELDAHL NITROGEN_TOTAL RECOVERABLE_NA_MG/L"
## [45] "CHLOROPHYLL A_NA_NA_UG/L"
## [46] "COUNT_NA_NA_COUNT"
## [47] "ARSENIC_DISSOLVED_NA_UG/L"
## [48] "LEAD_DISSOLVED_NA_UG/L"
## [49] "SELENIUM_TOTAL_NA_UG/L"
## [50] "THALLIUM_DISSOLVED_NA_UG/L"
## [51] "URANIUM_DISSOLVED_NA_UG/L"
## [52] "ALUMINUM_DISSOLVED_NA_UG/L"
## [53] "BERYLLIUM_DISSOLVED_NA_UG/L"
## [54] "BORON_DISSOLVED_NA_UG/L"
## [55] "CADMIUM_DISSOLVED_NA_UG/L"
## [56] "COBALT_DISSOLVED_NA_UG/L"
## [57] "COPPER_DISSOLVED_NA_UG/L"
## [58] "IRON_DISSOLVED_NA_UG/L"
## [59] "MAGNESIUM_DISSOLVED_NA_UG/L"
## [60] "MOLYBDENUM_DISSOLVED_NA_UG/L"
## [61] "NICKEL_DISSOLVED_NA_UG/L"
## [62] "POTASSIUM_DISSOLVED_NA_UG/L"
## [63] "SILVER_DISSOLVED_NA_UG/L"
## [64] "SODIUM_DISSOLVED_NA_UG/L"
## [65] "VANADIUM_DISSOLVED_NA_UG/L"
## [66] "CALCIUM_DISSOLVED_NA_UG/L"
## [67] "ZINC_DISSOLVED_NA_UG/L"
## [68] "HARDNESS, CARBONATE_NA_AS CACO3_UG/L"
## [69] "MERCURY_TOTAL_NA_UG/L"
## [70] "P,P'-DDD_TOTAL_NA_UG/L"
## [71] "P,P'-DDE_TOTAL_NA_UG/L"
## [72] "P,P'-DDT_TOTAL_NA_UG/L"
## [73] "ALDRIN_TOTAL_NA_UG/L"
## [74] "BENZENE HEXACHLORIDE, ALPHA (BHC)_TOTAL_NA_UG/L"
## [75] ".BETA.-HEXACHLOROCYCLOHEXANE_TOTAL_NA_UG/L"
## [76] "CHLORDANE_TOTAL_NA_UG/L"
## [77] ".DELTA.-HEXACHLOROCYCLOHEXANE_TOTAL_NA_UG/L"
## [78] "DIELDRIN_TOTAL_NA_UG/L"
## [79] ".ALPHA.-ENDOSULFAN_TOTAL_NA_UG/L"
## [80] ".BETA.-ENDOSULFAN_TOTAL_NA_UG/L"
## [81] "ENDOSULFAN SULFATE_TOTAL_NA_UG/L"
## [82] "ENDRIN_TOTAL_NA_UG/L"
## [83] "ENDRIN ALDEHYDE_TOTAL_NA_UG/L"
## [84] "BHC, .BETA.-BHC & .GAMMA.-BHC MIX, UNSPECIFIED_TOTAL_NA_UG/L"
## [85] "HEPTACHLOR_TOTAL_NA_UG/L"
## [86] "HEPTACHLOR EPOXIDE_TOTAL_NA_UG/L"
## [87] "METHOXYCHLOR_TOTAL_NA_UG/L"
## [88] "TOXAPHENE_TOTAL_NA_UG/L"
## [89] "ALPHA PARTICLE_NA_NA_PCI/L"
## [90] "RADIUM-226_NA_NA_PCI/L"
## [91] "RADIUM-228_NA_NA_PCI/L"
## [92] "TRITIUM_NA_NA_PCI/L"
## [93] "BARIUM_DISSOLVED_NA_UG/L"
## [94] "MANGANESE_DISSOLVED_NA_UG/L"
## [95] "NITRATE + NITRITE_INORGANIC_AS NO3_MG/L"
## [96] "TOTAL KJELDAHL NITROGEN (ORGANIC N & NH3)_UNFILTERED_AS N_MG/L"
## [97] "NITRATE + NITRITE_FILTERED_AS N_MG/L"
## [98] "TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L"
## [99] "AMMONIA_FILTERED_AS N_MG/L"
## [100] "TOTAL SUSPENDED SOLIDS_NON-FILTERABLE (PARTICLE)_NA_UG/L"
## [101] "ORGANIC CARBON_DISSOLVED_NA_UG/L"
## [102] "VOLATILE SUSPENDED SOLIDS_TOTAL_NA_UG/L"
## [103] "CONDITION CLASS (DISSOLVED OXYGEN (DO))_NA_NA_%"
## [104] "TEMPERATURE, SAMPLE_NA_NA_DEG C"
## [105] "DISSOLVED OXYGEN UPTAKE_NA_NA_UG/L"
## [106] "TURBIDITY FIELD_NA_NA_NTU"
## [107] "PHEOPHYTIN A_TOTAL_NA_UG/L"
## [108] "NICKEL_TOTAL_NA_UG/L"
## [109] "LEAD_TOTAL_NA_UG/L"
## [110] "CHLOROPHYLL A_UNFILTERED_NA_UG/L"
## [111] "CADMIUM_TOTAL_NA_UG/L"
## [112] "ARSENIC_TOTAL_NA_UG/L"
## [113] "ZINC_TOTAL_NA_UG/L"
## [114] "HARDNESS, CA, MG_TOTAL_NA_MG/L"
## [115] "ORTHOPHOSPHATE_TOTAL RECOVERABLE_AS P_UG/L"
## [116] "APPARENT COLOR_TOTAL_NA_PCU"
## [117] "MERCURY_DISSOLVED_NA_UG/L"
## [118] "PERIPHYTON_NA_NA_G/M2"
# choose two and generate scatterplot
TADA_TwoCharacteristicScatterplot(TADAProfileClean6, id_cols = "TADA.ComparableDataIdentifier", groups = c("TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L", "TOTAL PHOSPHORUS, MIXED FORMS_UNFILTERED_AS P_UG/L"))
Now we will summarize results for a single comparable data group using the TADA.ComparableDataIdentifier (i.e., comparable characteristic, unit, speciation, and fraction combination) using TADA_Histogram and TADA_Boxplot. Note that users may generate a list output of multiple plots if their input dataset has more than one unique comparable data group.
# review TADA.ComparableDataIdentifier
unique(TADAProfileClean5$TADA.ComparableDataIdentifier)
## [1] "DISSOLVED OXYGEN (DO)_NA_NA_MG/L"
## [2] "PH_NA_NA_NA"
## [3] "TEMPERATURE_NA_NA_DEG C"
## [4] "TURBIDITY_NA_NA_NTU"
## [5] "AMMONIUM_UNFILTERED_AS N_MG/L"
## [6] "NITRATE_UNFILTERED_NA_MG/L"
## [7] "CHLORIDE_TOTAL_NA_UG/L"
## [8] "FLOW_NA_NA_CFS"
## [9] "ESCHERICHIA COLI_NA_NA_CFU/100ML"
## [10] "ESCHERICHIA COLI_NA_NA_MPN"
## [11] "NITRITE_UNFILTERED_NA_MG/L"
## [12] "ORTHOPHOSPHATE_UNFILTERED_AS P_UG/L"
## [13] "DEPTH_NA_NA_M"
## [14] "DISSOLVED OXYGEN SATURATION_NA_NA_%"
## [15] "SPECIFIC CONDUCTANCE_NA_NA_US/CM"
## [16] "TOTAL DISSOLVED SOLIDS_NA_NA_UG/L"
## [17] "SALINITY_NA_NA_PSS"
## [18] "SULFATE_TOTAL_NA_UG/L"
## [19] "AMMONIA_UNFILTERED_AS N_MG/L"
## [20] "BIOCHEMICAL OXYGEN DEMAND, STANDARD CONDITIONS_TOTAL_NA_UG/L"
## [21] "FLUORIDE_TOTAL_NA_UG/L"
## [22] "CHROMIUM_TOTAL_NA_UG/L"
## [23] "COPPER_TOTAL_NA_UG/L"
## [24] "IRON_TOTAL_NA_UG/L"
## [25] "MANGANESE_TOTAL_NA_UG/L"
## [26] "SILICON_TOTAL_NA_UG/L"
## [27] "HARDNESS, CARBONATE_TOTAL_NA_UG/L"
## [28] "TOTAL HARDNESS_TOTAL_NA_MG/L"
## [29] "CHEMICAL OXYGEN DEMAND_TOTAL_NA_UG/L"
## [30] "TOTAL PHOSPHORUS, MIXED FORMS_UNFILTERED_AS P_UG/L"
## [31] "CHLOROPHYLL A, CORRECTED FOR PHEOPHYTIN_SUSPENDED_NA_UG/L"
## [32] "CONDUCTIVITY_NA_NA_US/CM"
## [33] "DEPTH, SECCHI DISK DEPTH_NA_NA_M"
## [34] "STREAM STAGE_NA_NA_M"
## [35] "TRANSPARENCY, SECCHI TUBE WITH DISK_NA_NA_IN"
## [36] "ICE THICKNESS_NA_NA_IN"
## [37] "DEPTH, SNOW COVER_NA_NA_IN"
## [38] "ORTHOPHOSPHATE_FILTERED_AS P_UG/L"
## [39] "BAROMETRIC PRESSURE_NA_NA_G/M2"
## [40] "TOTAL DISSOLVED SOLIDS_TOTAL_NA_UG/L"
## [41] "TOTAL KJELDAHL NITROGEN (ORGANIC N & NH3)_UNFILTERED_NA_MG/L"
## [42] "NITRITE_UNFILTERED_AS N_MG/L"
## [43] "NITRATE_UNFILTERED_AS N_MG/L"
## [44] "KJELDAHL NITROGEN_TOTAL RECOVERABLE_NA_MG/L"
## [45] "CHLOROPHYLL A_NA_NA_UG/L"
## [46] "COUNT_NA_NA_COUNT"
## [47] "ARSENIC_DISSOLVED_NA_UG/L"
## [48] "LEAD_DISSOLVED_NA_UG/L"
## [49] "SELENIUM_TOTAL_NA_UG/L"
## [50] "THALLIUM_DISSOLVED_NA_UG/L"
## [51] "URANIUM_DISSOLVED_NA_UG/L"
## [52] "ALUMINUM_DISSOLVED_NA_UG/L"
## [53] "BERYLLIUM_DISSOLVED_NA_UG/L"
## [54] "BORON_DISSOLVED_NA_UG/L"
## [55] "CADMIUM_DISSOLVED_NA_UG/L"
## [56] "COBALT_DISSOLVED_NA_UG/L"
## [57] "COPPER_DISSOLVED_NA_UG/L"
## [58] "IRON_DISSOLVED_NA_UG/L"
## [59] "MAGNESIUM_DISSOLVED_NA_UG/L"
## [60] "MOLYBDENUM_DISSOLVED_NA_UG/L"
## [61] "NICKEL_DISSOLVED_NA_UG/L"
## [62] "POTASSIUM_DISSOLVED_NA_UG/L"
## [63] "SILVER_DISSOLVED_NA_UG/L"
## [64] "SODIUM_DISSOLVED_NA_UG/L"
## [65] "VANADIUM_DISSOLVED_NA_UG/L"
## [66] "CALCIUM_DISSOLVED_NA_UG/L"
## [67] "ZINC_DISSOLVED_NA_UG/L"
## [68] "HARDNESS, CARBONATE_NA_AS CACO3_UG/L"
## [69] "MERCURY_TOTAL_NA_UG/L"
## [70] "P,P'-DDD_TOTAL_NA_UG/L"
## [71] "P,P'-DDE_TOTAL_NA_UG/L"
## [72] "P,P'-DDT_TOTAL_NA_UG/L"
## [73] "ALDRIN_TOTAL_NA_UG/L"
## [74] "BENZENE HEXACHLORIDE, ALPHA (BHC)_TOTAL_NA_UG/L"
## [75] ".BETA.-HEXACHLOROCYCLOHEXANE_TOTAL_NA_UG/L"
## [76] "CHLORDANE_TOTAL_NA_UG/L"
## [77] ".DELTA.-HEXACHLOROCYCLOHEXANE_TOTAL_NA_UG/L"
## [78] "DIELDRIN_TOTAL_NA_UG/L"
## [79] ".ALPHA.-ENDOSULFAN_TOTAL_NA_UG/L"
## [80] ".BETA.-ENDOSULFAN_TOTAL_NA_UG/L"
## [81] "ENDOSULFAN SULFATE_TOTAL_NA_UG/L"
## [82] "ENDRIN_TOTAL_NA_UG/L"
## [83] "ENDRIN ALDEHYDE_TOTAL_NA_UG/L"
## [84] "BHC, .BETA.-BHC & .GAMMA.-BHC MIX, UNSPECIFIED_TOTAL_NA_UG/L"
## [85] "HEPTACHLOR_TOTAL_NA_UG/L"
## [86] "HEPTACHLOR EPOXIDE_TOTAL_NA_UG/L"
## [87] "METHOXYCHLOR_TOTAL_NA_UG/L"
## [88] "TOXAPHENE_TOTAL_NA_UG/L"
## [89] "ALPHA PARTICLE_NA_NA_PCI/L"
## [90] "RADIUM-226_NA_NA_PCI/L"
## [91] "RADIUM-228_NA_NA_PCI/L"
## [92] "TRITIUM_NA_NA_PCI/L"
## [93] "BARIUM_DISSOLVED_NA_UG/L"
## [94] "MANGANESE_DISSOLVED_NA_UG/L"
## [95] "NITRATE + NITRITE_INORGANIC_AS NO3_MG/L"
## [96] "TOTAL KJELDAHL NITROGEN (ORGANIC N & NH3)_UNFILTERED_AS N_MG/L"
## [97] "NITRATE + NITRITE_FILTERED_AS N_MG/L"
## [98] "TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L"
## [99] "AMMONIA_FILTERED_AS N_MG/L"
## [100] "TOTAL SUSPENDED SOLIDS_NON-FILTERABLE (PARTICLE)_NA_UG/L"
## [101] "ORGANIC CARBON_DISSOLVED_NA_UG/L"
## [102] "VOLATILE SUSPENDED SOLIDS_TOTAL_NA_UG/L"
## [103] "CONDITION CLASS (DISSOLVED OXYGEN (DO))_NA_NA_%"
## [104] "TEMPERATURE, SAMPLE_NA_NA_DEG C"
## [105] "DISSOLVED OXYGEN UPTAKE_NA_NA_UG/L"
## [106] "TURBIDITY FIELD_NA_NA_NTU"
## [107] "PHEOPHYTIN A_TOTAL_NA_UG/L"
## [108] "NICKEL_TOTAL_NA_UG/L"
## [109] "LEAD_TOTAL_NA_UG/L"
## [110] "CHLOROPHYLL A_UNFILTERED_NA_UG/L"
## [111] "CADMIUM_TOTAL_NA_UG/L"
## [112] "ARSENIC_TOTAL_NA_UG/L"
## [113] "ZINC_TOTAL_NA_UG/L"
## [114] "HARDNESS, CA, MG_TOTAL_NA_MG/L"
## [115] "ORTHOPHOSPHATE_TOTAL RECOVERABLE_AS P_UG/L"
## [116] "APPARENT COLOR_TOTAL_NA_PCU"
## [117] "MERCURY_DISSOLVED_NA_UG/L"
## [118] "PERIPHYTON_NA_NA_G/M2"
# filter dataframe to only "TOTAL PHOSPHORUS, MIXED FORMS"
TADAProfileCleanTP <- dplyr::filter(TADAProfileClean6, TADA.ComparableDataIdentifier == "TOTAL PHOSPHORUS, MIXED FORMS_UNFILTERED_AS P_UG/L")
# generate stats table
TADAProfileCleanTP_stats <- TADA_Stats(TADAProfileCleanTP)
## [1] "Note: Your dataset contains TADA-generated total nutrient results, which have fewer columns populated with metadata. This might affect how groups are displayed in the stats table."
TADAProfileCleanTP_stats
## # A tibble: 1 × 23
## TADA.ComparableDataIdentif…¹ Location_Count Measurement_Count Non_Detect_Count
## <chr> <int> <int> <int>
## 1 TOTAL PHOSPHORUS, MIXED FOR… 134 1190 1
## # ℹ abbreviated name: ¹TADA.ComparableDataIdentifier
## # ℹ 19 more variables: Non_Detect_Pct <dbl>, Non_Detect_Lvls <int>,
## # Over_Detect_Count <int>, Over_Detect_Pct <dbl>, UpperFence <dbl>,
## # LowerFence <dbl>, Min <dbl>, Mean <dbl>, Max <dbl>, Percentile_5th <dbl>,
## # Percentile_10th <dbl>, Percentile_15th <dbl>, Percentile_25th <dbl>,
## # Percentile_50th_Median <dbl>, Percentile_75th <dbl>, Percentile_85th <dbl>,
## # Percentile_95th <dbl>, Percentile_98th <dbl>, ND_Estimation_Method <chr>
# generate a histogram
TP_Histogram <- TADA_Histogram(TADAProfileCleanTP, id_cols = "TADA.ComparableDataIdentifier")
# view histogram
TP_Histogram
Generate interactive box plot.
TP_Boxplot <- TADA_Boxplot(TADAProfileCleanTP, id_cols = "TADA.ComparableDataIdentifier")
TP_Boxplot
Generate interactive scatterplot.
TADAProfileCleanTP_dailymax <- TADA_AggregateMeasurements(TADAProfileCleanTP,
agg_fun = c("max"),
clean = TRUE
)
## [1] "Aggregation results:"
##
## No aggregation needed Selected as max aggregate value
## 1056 67
TP_Scatterplot <- TADA_Scatterplot(TADAProfileCleanTP_dailymax, id_cols = "TADA.ComparableDataIdentifier")
TP_Scatterplot
Retain TADA Required Columns
Now we can review the “TADA” prefixed columns we have added to the data set. If we are satisfied with the conversions, filtering, flagging, etc. and the resulting “TADA” columns, we can use the TADA_RetainRequired function to remove any columns that are not required or used as filters in the TADA workflow. This reduces the size of the data frame.
TADAProfileClean7 <- TADA_RetainRequired(TADAProfileClean6)
## [1] "TADA_RetainRequired: removing columns not required for TADA workflow including original columns that have been replaced with TADA prefix duplicates."
## [1] "The following non-required columns were removed: ActivityEndDate, ActivityEndTime.Time, ActivityEndTime.TimeZoneCode, ActivityEndDateTime, ActivityConductingOrganizationText, ActivityLocation.LatitudeMeasure, ActivityLocation.LongitudeMeasure, AnalysisStartDate, ResultDetectionQuantitationLimitUrl, ActivityStartTime.TimeZoneCode_offset, ActivityEndTime.TimeZoneCode_offset, SourceMapScaleNumeric, HorizontalCollectionMethodName, ProviderName and LastUpdated."
TADA Shiny Application
Finally, take a look at an alternative workflow, TADA Shiny Module 1: Data Discovery and Cleaning. This is a Shiny application that runs many of the TADA functions covered in this document behind a graphical user interface. The shiny application queries the WQP, contains maps and data visualizations, flags suspect data results, handles censored data, and more. You can launch it using the code below.
DRAFT Module 1 is also currently hosted on the web with minimal server memory/storage allocated.
# download TADA Shiny repository
remotes::install_github("USEPA/TADAShiny",
ref = "develop",
dependencies = TRUE
)
# launch the app locally.
TADAShiny::run_app()
Download this Article from GitHub
Go to: https://github.com/USEPA/EPATADA/blob/develop/vignettes/TADAModule1.Rmd