03 - Clean

14 January 2015

In this third lesson we are going to start working on maninpulating and cleaning up our data frames. We are spending some time on this because, in my experience, most data analysis and statistics classes seem to assume that 95% of the time spent working with data is on the analysis and interpretation of that analysis and little time is spent getting data ready to analyze. However, in reality, I’d argue, the time spent is flipped with most time spent on cleaning up data and significantly less time on the analysis. We will just be scratching the surface of the many ways you can work with data in R. We will show the basics of subsetting, merging, modifying, and sumarizing data and our examples will all use Hadley Wickham and Romain Francois’ dplyr package. There are many ways to do this type of work in R, many of which are available from base R, but I heard from many (AED colleagues and Hadley himself!) focusing on one way to do this is best, so dplyr it is!

Before we jump into the lesson, quick links and lesson goals are:

##Quick Links to Exercises and R code - Lesson 3 R Code: All the code from this post in an R Script. - Exercise 1: Subsetting the NLA data with dplyr - Exercise 2: Merge two NLA data files together. - Exercise 3: Using dplyr to modify and summarize the NLA.

##Lesson Goals - Show and tell on using base R for data manipulation - Better understand data cleaning through use of dplyr - Use merge() to combine data frames by a common key. - Do some basic reshaping and summarizing data frames. - Know what pipes are and why you might want to use them

##What is dplyr?

The package dplyr is a fairly new (2014) package that tries to provide easy tools for the most common data manipulation tasks. It is built to work directly with data frames. The thinking behind it was largely inspired by the package plyr which has been in use for some time but suffered from being slow in some cases. dplyr addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.

This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly and pull back just what you need for analysis in R. There is a lot of great info on dplyr. If you have an interest, i’d encourage you to look more. The vignettes are particularly good.

##Subsetting in base R In base R you can use a indexing to select out rows and columns. You will see this quite often in other peoples’ code, so I want to at least show it to you.

#Create a vector
x<-c(10:19)
x
##  [1] 10 11 12 13 14 15 16 17 18 19
#Positive indexing returns just the value in the ith place
x[7]
## [1] 16
#Negative indexing returns all values except the value in the ith place
x[-3]
## [1] 10 11 13 14 15 16 17 18 19
#Ranges work too
x[8:10]
## [1] 17 18 19
#A vector can be used to index
#Can be numeric
x[c(2,6,10)]
## [1] 11 15 19
#Can be boolean - will repeat the pattern 
x[c(TRUE,FALSE)]
## [1] 10 12 14 16 18
#Can even get fancy
x[x%%2==0]
## [1] 10 12 14 16 18

You can also index a data frame or select individual columns of a data frame. Since a data frame has two dimensions, you need to specify an index for both the row and the column. You can specify both and get a single value like data_frame[row,column],specify just the row and the get the whole row back like data_frame[row,] or get just the column with data_frame[,column]. These examples show that.

#Let's use one of the stock data frames in R, iris
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
#And grab a specific value
iris[1,1]
## [1] 5.1
#A whole column
petal_len<-iris[,3]
petal_len
##   [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3
##  [18] 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4
##  [35] 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7
##  [52] 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1
##  [69] 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5
##  [86] 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1
## [103] 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9
## [120] 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1
## [137] 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1
#A row
obs15<-iris[15,]
obs15
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 15          5.8           4          1.2         0.2  setosa
#Many rows
obs3to7<-iris[3:7,]
obs3to7
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## 7          4.6         3.4          1.4         0.3  setosa

Also remember that data frames have column names. We can use those too. Let’s try it.

#First, there are a couple of ways to use the column names
iris$Petal.Length
##   [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3
##  [18] 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4
##  [35] 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7
##  [52] 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1
##  [69] 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5
##  [86] 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1
## [103] 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9
## [120] 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1
## [137] 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1
head(iris["Petal.Length"])
##   Petal.Length
## 1          1.4
## 2          1.4
## 3          1.3
## 4          1.5
## 5          1.4
## 6          1.7
#Multiple colums
head(iris[c("Petal.Length","Species")])
##   Petal.Length Species
## 1          1.4  setosa
## 2          1.4  setosa
## 3          1.3  setosa
## 4          1.5  setosa
## 5          1.4  setosa
## 6          1.7  setosa
#Now we can combine what we have seen to do some more complex queries
#Lets get all the data for Species with a petal length greater than 6
big_iris<-iris[iris$Petal.Length>=6,]
head(big_iris)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 101          6.3         3.3          6.0         2.5 virginica
## 106          7.6         3.0          6.6         2.1 virginica
## 108          7.3         2.9          6.3         1.8 virginica
## 110          7.2         3.6          6.1         2.5 virginica
## 118          7.7         3.8          6.7         2.2 virginica
## 119          7.7         2.6          6.9         2.3 virginica
#Or maybe we want just the sepal widths of the virginica species
virginica_iris<-iris$Sepal.Width[iris$Species=="virginica"]
head(virginica_iris)
## [1] 3.3 2.7 3.0 2.9 3.0 3.0

##Data Manipulation in dplyr So, base R can do what you need, but it is a bit complicated and the syntax is a bit dense. In dplyr this can be done with two functions, select() and filter(). The code can be a bit more verbose, but it allows you to write code that is much more readable. Before we start we need to make sure we’ve got everything installed and loaded. If you do not have R Version 3.0.2 or greater you will have some problems (i.e. no dplyr for you).

install.packages("dplyr")
library("dplyr")

I am going to repeat some of what I showed above on data frames but now with dplyr. This is what we will be using in the exercises.

#First, select some columns
dplyr_sel<-select(iris,Sepal.Length,Petal.Length,Species)
#That's it.  Select one or many columns
#Now select some, like before
dplyr_big_iris<-filter(iris, Petal.Length>=6)
head(dplyr_big_iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1          6.3         3.3          6.0         2.5 virginica
## 2          7.6         3.0          6.6         2.1 virginica
## 3          7.3         2.9          6.3         1.8 virginica
## 4          7.2         3.6          6.1         2.5 virginica
## 5          7.7         3.8          6.7         2.2 virginica
## 6          7.7         2.6          6.9         2.3 virginica
#Or maybe we want just the virginica species
virginica_iris<-filter(iris,Species=="virginica")
head(virginica_iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1          6.3         3.3          6.0         2.5 virginica
## 2          5.8         2.7          5.1         1.9 virginica
## 3          7.1         3.0          5.9         2.1 virginica
## 4          6.3         2.9          5.6         1.8 virginica
## 5          6.5         3.0          5.8         2.2 virginica
## 6          7.6         3.0          6.6         2.1 virginica

But what if I wanted to select and filter? There are three ways to do this: use intermediate steps, nested functions, or pipes. With the intermediate steps, you essentially create a temporary data frame and use that as input to the next function. You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as the process from inside out. The last option, pipes, are a fairly recent addition to R. Pipes in the unix/linux world are not new and allow you to chain commands together where the output of one command is the input to the next. This provides a more natural way to read the commands in that they are executed in the way you conceptualize it and make the interpretation of the code a bit easier. Pipes in R look like %>% and are made available via th magrittr package, which is installed as part of dplyr. Let’s try all three with the same analysis: selecting out a subset of columns but for only a single species.

#Intermediate data frames
#Select First: note the order of the output, neat too!
dplyr_big_iris_tmp1<-select(iris,Species,Sepal.Length,Petal.Length)
dplyr_big_iris_tmp<-filter(dplyr_big_iris_tmp1,Petal.Length>=6)
head(dplyr_big_iris_tmp)
##     Species Sepal.Length Petal.Length
## 1 virginica          6.3          6.0
## 2 virginica          7.6          6.6
## 3 virginica          7.3          6.3
## 4 virginica          7.2          6.1
## 5 virginica          7.7          6.7
## 6 virginica          7.7          6.9
#Nested function
dplyr_big_iris_nest<-filter(select(iris,Species,Sepal.Length,Petal.Length),Species=="virginica")
head(dplyr_big_iris_nest)
##     Species Sepal.Length Petal.Length
## 1 virginica          6.3          6.0
## 2 virginica          5.8          5.1
## 3 virginica          7.1          5.9
## 4 virginica          6.3          5.6
## 5 virginica          6.5          5.8
## 6 virginica          7.6          6.6
#Pipes
dplyr_big_iris_pipe<-select(iris,Species,Sepal.Length,Petal.Length) %>%
  filter(Species=="virginica")
head(dplyr_big_iris_pipe)
##     Species Sepal.Length Petal.Length
## 1 virginica          6.3          6.0
## 2 virginica          5.8          5.1
## 3 virginica          7.1          5.9
## 4 virginica          6.3          5.6
## 5 virginica          6.5          5.8
## 6 virginica          7.6          6.6

##Exercise 1 This exercise is going to focus on using what we just covered on dplyr to start to clean up the National Lakes Assessment data files. Remember to use the stickies: green when you’re done, red if you have a problem.

  1. If it isn’t already open, make sure you have the script we created, “nla_analysis.R” opened up.
  2. Start a new section of code in this script by simply putting in a line or two of comments indicating what it is this set of code does.
  3. Our goal for this is to create two new data frames that represent a subset of the observations as well as a subset of the data.
  4. First, from the nla_sites data frame we want a new data frame that has only the following columns: SITE_ID, VISIT_NO, SITE_TYPE, LON_DD, LAT_DD, STATE_NAME, WSA_ECO9, NUT_REG, NUTREG_NAME, LAKE_ORIGIN, and RT_NLA. Name the new data frame nla_sites_subset. Think select()
  5. Next, lets subset the water quality data from nla_wq. The columns we want for this are: SITE_ID, VISIT_NO, SITE_TYPE, TURB, NTL, PTL, CHLA, and SECMEAN.
  6. Last thing we are going to need to do is get a subset of the observations from nla_wq_subset and nla_sites_subset. We need only the lakes with VISIT_NO equal to 1 and SITE_TYPE equal to “PROB_Lake”. Keep the same name, nla_wq_subset and nla_sites_subset, for these data frames.

##Merging Data Joining data in dplyr is accomplished via the various x_join() commands (e.g., inner_join, left_join, anti_join, etc). These are very SQL-esque so if you speak SQL (I am far from fluent!) then these will be pretty easy for you. If not then they aren’t immediately intutive. For our purposes, the base functions rbind() and merge() are more than adequate.

We are going to talk about several different ways to do this. First, let’s add some new rows to a data frame. This is very handy as you might have data collected and entered at one time, and then additional observations made later that need to be added. So with rbind() we can stack two data frames with the same columns to store more observations.

#Let's first create a new small example data.frame
rbind_df<-data.frame(a=1:3,b=c("a","b","c"),c=c(T,T,F),d=rnorm(3))
#Now an example df to add
rbind_df2<-data.frame(a=10:12,b=c("x","y","z"),c=c(F,F,F),d=rnorm(3))
rbind_df<-rbind(rbind_df, rbind_df2)
rbind_df
##    a b     c           d
## 1  1 a  TRUE  1.33148801
## 2  2 b  TRUE -0.25197716
## 3  3 c FALSE  1.57720988
## 4 10 x FALSE -0.19499561
## 5 11 y FALSE  1.53880986
## 6 12 z FALSE  0.07173521

Now something to think about. Could you add a vector as a new row? Why/Why not? When/When not?

Let’s go back to the columns now. There are simple ways to add columns of the same length with observations in the same order to a data frame, but it is very common to have to datasets that are in different orders and have differing numbers of rows. What we want to do in that case is going to be more of a database type function and join two tables based on a common column. A common way to do that in base R is with merge(). So let’s contrive another example by creating a dataset to merge to rbind_df that we created above.

# Contrived data frame
rbind_df_merge_me<-data.frame(a=c(1,3,10,11,14,6,23),x=rnorm(7),names=c("bob","joe","sue",NA,NA,"jeff",NA))
# Create merge of matches
rbind_df_merge_match<-merge(rbind_df,rbind_df_merge_me,by="a")
rbind_df_merge_match
##    a b     c          d         x names
## 1  1 a  TRUE  1.3314880 0.8904878   bob
## 2  3 c FALSE  1.5772099 1.8998132   joe
## 3 10 x FALSE -0.1949956 0.2495541   sue
## 4 11 y FALSE  1.5388099 1.7625538  <NA>
# Create merge of matches and all of the first data frame
rbind_df_merge_allx<-merge(rbind_df,rbind_df_merge_me,by="a",all.x=TRUE)
rbind_df_merge_allx
##    a b     c           d         x names
## 1  1 a  TRUE  1.33148801 0.8904878   bob
## 2  2 b  TRUE -0.25197716        NA  <NA>
## 3  3 c FALSE  1.57720988 1.8998132   joe
## 4 10 x FALSE -0.19499561 0.2495541   sue
## 5 11 y FALSE  1.53880986 1.7625538  <NA>
## 6 12 z FALSE  0.07173521        NA  <NA>

##Exercise 2 In this exercise we are going to practice merging our NLA data. You will remember that we have two datasets, one on water quality and one on site information. We selected some info out of the water quality data that we didn’t select out of the site info so we have two data frames, with differing numbers of rows and unknown order. Use your stickies!

  1. This is the only task we have for this exercise. Add to your script a line (or more if you need it) to create a new data frame, nla_data, that is a merge of nla_sites_subset and nla_wq_subset, but with all lines in nla_wq_subset preserved in the output. The column to merge on is “SITE_ID”
  2. This data frame will have some NA values. For the purposes of these lessons, it is better to remove these. Add another line to your code and create a data frame that replaces nla_data that has had all of the NA values removed.
  3. If that goes quickly, feel free to explore rbind() some.

##Modify and Summarize Now back to dplyr. One area where it really shines is in modifying and summarizing. We will do more here than we did with base, but first lets walk through one of the examples we did previously, aggregating. We can do this with group_by() and summarize().

First, we’ll look at an example of grouping a data frame and summarizing the data within those groups.

#Chained with Pipes
group_by(iris,Species)%>%
  summarize(mean(Sepal.Length),
            mean(Sepal.Width),
            mean(Petal.Length),
            mean(Petal.Width))
## Source: local data frame [3 x 5]
## 
##      Species mean(Sepal.Length) mean(Sepal.Width) mean(Petal.Length)
## 1     setosa              5.006             3.428              1.462
## 2 versicolor              5.936             2.770              4.260
## 3  virginica              6.588             2.974              5.552
## Variables not shown: mean(Petal.Width) (dbl)

There are many other functions in dplyr that are useful. Much of what they do, can certainly be accomplished with base R, but not quite as intuitively. Let’s run through some examples with arrange(), slice(), and mutate().

First arrange() will re-order a data frame based on the values of a columns. It will take multiple columns and can be in descending or ascending order. I think iris is getting a bit tired, let’s try a different stock data frame this time: mtcars. If you are interested you can try data() to see what is available.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
#ascending order is default
head(arrange(mtcars,mpg))
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## 2 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## 3 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## 4 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
## 5 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
## 6 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8
#descending
head(arrange(mtcars,desc(mpg)))
##    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 2 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 3 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 4 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 5 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 6 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#multiple columns: most cyl with best mpg at top
head(arrange(mtcars,desc(cyl),desc(mpg)))
##    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 2 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 3 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 4 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 5 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 6 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2

Now slice() which accomplishes what we did with the numeric indices before. Remembering back to that, we’d could grab rows of the data frame with something like x[1:3,].

#grab rows 3 through 10
slice(mtcars,3:10)
##    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 2 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 3 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 4 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 5 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 6 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 7 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 8 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

mutate() allows us to add new columns based on expressions applied to existing columns

head(mutate(mtcars,kml=mpg*0.425))
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb    kml
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 8.9250
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 8.9250
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 9.6900
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 9.0950
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 7.9475
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 7.6925

Lastly, one more function, rowwise() allows us run rowwise, operations. Let’s use a bit of a contrived example for this.

#First a dataset of temperatures, recorded weekly at 100 sites.
temp_df<-data.frame(id=1:100,week1=runif(100,20,25), week2=runif(100,19,24), 
                    week3=runif(100,18,26), week4=runif(100,17,23))
head(temp_df)
##   id    week1    week2    week3    week4
## 1  1 20.64609 22.71119 21.21706 17.67503
## 2  2 23.83892 20.51013 21.01253 19.28746
## 3  3 20.75985 22.31766 19.66867 20.40041
## 4  4 22.97963 20.04429 20.15280 22.23339
## 5  5 20.10866 21.42423 22.63455 21.89657
## 6  6 20.18410 22.72863 20.00610 17.21429
#To add row means to the dataset, without the ID
temp_df2<-temp_df %>% 
  rowwise() %>% 
  mutate(site_mean = mean(c(week1,week2,week3,week4)))
head(temp_df2)
## Source: local data frame [6 x 6]
## 
##   id    week1    week2    week3    week4 site_mean
## 1  1 20.64609 22.71119 21.21706 17.67503  20.56235
## 2  2 23.83892 20.51013 21.01253 19.28746  21.16226
## 3  3 20.75985 22.31766 19.66867 20.40041  20.78665
## 4  4 22.97963 20.04429 20.15280 22.23339  21.35253
## 5  5 20.10866 21.42423 22.63455 21.89657  21.51600
## 6  6 20.18410 22.72863 20.00610 17.21429  20.03328

We now have quite a few tools that we can use to clean and manipulate data in R. We have barely touched what both base R and dplyr are capable of accomplishing, but hopefully you now have some basics to build on. I personally think the database connection in dplyr are going to prove very useful.

Let’s practice some of these last functions with our NLA data.

##Exercise 3 Add a new section to our script to calculate the nla water quality means.

  1. Use nla_data that we created in the previous exercise.
  2. Add some lines to your script to calculate the mean by LAKE_ORIGIN, for TURB, NTL, PTL,CHLA, and SECMEAN. Save to a data frame named nla_wq_means_orig.
  3. Repeat the same analysis but for the WSA_ECO9 ecoregions. Save this to a data frame named nla_wq_means_eco.
  4. It might be interesting to compare the grouped means to the means of each value for the entire dataset. Using summarize(), calculate the mean wq for all lakes (hint: no groups!). Save this as nla_wq_means.