03 - Clean
14 January 2015
In this third lesson we are going to start working on maninpulating and cleaning up our data frames. We are spending some time on this because, in my experience, most data analysis and statistics classes seem to assume that 95% of the time spent working with data is on the analysis and interpretation of that analysis and little time is spent getting data ready to analyze. However, in reality, I’d argue, the time spent is flipped with most time spent on cleaning up data and significantly less time on the analysis. We will just be scratching the surface of the many ways you can work with data in R. We will show the basics of subsetting, merging, modifying, and sumarizing data and our examples will all use Hadley Wickham and Romain Francois’
dplyr package. There are many ways to do this type of work in R, many of which are available from base R, but I heard from many (AED colleagues and Hadley himself!) focusing on one way to do this is best, so
dplyr it is!
Before we jump into the lesson, quick links and lesson goals are:
##Quick Links to Exercises and R code
- Lesson 3 R Code: All the code from this post in an R Script.
- Exercise 1: Subsetting the NLA data with
- Exercise 2: Merge two NLA data files together.
- Exercise 3: Using
dplyr to modify and summarize the NLA.
- Show and tell on using base R for data manipulation
- Better understand data cleaning through use of
merge() to combine data frames by a common key.
- Do some basic reshaping and summarizing data frames.
- Know what pipes are and why you might want to use them
dplyr is a fairly new (2014) package that tries to provide easy tools for the most common data manipulation tasks. It is built to work directly with data frames. The thinking behind it was largely inspired by the package
plyr which has been in use for some time but suffered from being slow in some cases.
dplyr addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.
This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly and pull back just what you need for analysis in R. There is a lot of great info on
dplyr. If you have an interest, i’d encourage you to look more. The vignettes are particularly good.
##Subsetting in base R In base R you can use a indexing to select out rows and columns. You will see this quite often in other peoples’ code, so I want to at least show it to you.
You can also index a data frame or select individual columns of a data frame. Since a data frame has two dimensions, you need to specify an index for both the row and the column. You can specify both and get a single value like
data_frame[row,column],specify just the row and the get the whole row back like
data_frame[row,] or get just the column with
data_frame[,column]. These examples show that.
Also remember that data frames have column names. We can use those too. Let’s try it.
##Data Manipulation in
So, base R can do what you need, but it is a bit complicated and the syntax is a bit dense. In
dplyr this can be done with two functions,
filter(). The code can be a bit more verbose, but it allows you to write code that is much more readable. Before we start we need to make sure we’ve got everything installed and loaded. If you do not have R Version 3.0.2 or greater you will have some problems (i.e. no
dplyr for you).
I am going to repeat some of what I showed above on data frames but now with
dplyr. This is what we will be using in the exercises.
But what if I wanted to select and filter? There are three ways to do this: use intermediate steps, nested functions, or pipes. With the intermediate steps, you essentially create a temporary data frame and use that as input to the next function. You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as the process from inside out. The last option, pipes, are a fairly recent addition to R. Pipes in the unix/linux world are not new and allow you to chain commands together where the output of one command is the input to the next. This provides a more natural way to read the commands in that they are executed in the way you conceptualize it and make the interpretation of the code a bit easier. Pipes in R look like
%>% and are made available via th
magrittr package, which is installed as part of
dplyr. Let’s try all three with the same analysis: selecting out a subset of columns but for only a single species.
This exercise is going to focus on using what we just covered on
dplyr to start to clean up the National Lakes Assessment data files. Remember to use the stickies: green when you’re done, red if you have a problem.
- If it isn’t already open, make sure you have the script we created, “nla_analysis.R” opened up.
- Start a new section of code in this script by simply putting in a line or two of comments indicating what it is this set of code does.
- Our goal for this is to create two new data frames that represent a subset of the observations as well as a subset of the data.
- First, from the
nla_sitesdata frame we want a new data frame that has only the following columns: SITE_ID, VISIT_NO, SITE_TYPE, LON_DD, LAT_DD, STATE_NAME, WSA_ECO9, NUT_REG, NUTREG_NAME, LAKE_ORIGIN, and RT_NLA. Name the new data frame
- Next, lets subset the water quality data from
nla_wq. The columns we want for this are: SITE_ID, VISIT_NO, SITE_TYPE, TURB, NTL, PTL, CHLA, and SECMEAN.
- Last thing we are going to need to do is get a subset of the observations from
nla_sites_subset. We need only the lakes with VISIT_NO equal to 1 and SITE_TYPE equal to “PROB_Lake”. Keep the same name,
nla_sites_subset, for these data frames.
Joining data in
dplyr is accomplished via the various
x_join() commands (e.g.,
anti_join, etc). These are very SQL-esque so if you speak SQL (I am far from fluent!) then these will be pretty easy for you. If not then they aren’t immediately intutive. For our purposes, the base functions
merge() are more than adequate.
We are going to talk about several different ways to do this. First, let’s add some new rows to a data frame. This is very handy as you might have data collected and entered at one time, and then additional observations made later that need to be added. So with
rbind() we can stack two data frames with the same columns to store more observations.
Now something to think about. Could you add a vector as a new row? Why/Why not? When/When not?
Let’s go back to the columns now. There are simple ways to add columns of the same length with observations in the same order to a data frame, but it is very common to have to datasets that are in different orders and have differing numbers of rows. What we want to do in that case is going to be more of a database type function and join two tables based on a common column. A common way to do that in base R is with
merge(). So let’s contrive another example by creating a dataset to merge to
rbind_df that we created above.
##Exercise 2 In this exercise we are going to practice merging our NLA data. You will remember that we have two datasets, one on water quality and one on site information. We selected some info out of the water quality data that we didn’t select out of the site info so we have two data frames, with differing numbers of rows and unknown order. Use your stickies!
- This is the only task we have for this exercise. Add to your script a line (or more if you need it) to create a new data frame,
nla_data, that is a merge of
nla_wq_subset, but with all lines in
nla_wq_subsetpreserved in the output. The column to merge on is “SITE_ID”
- This data frame will have some
NAvalues. For the purposes of these lessons, it is better to remove these. Add another line to your code and create a data frame that replaces
nla_datathat has had all of the
- If that goes quickly, feel free to explore
##Modify and Summarize
Now back to
dplyr. One area where it really shines is in modifying and summarizing. We will do more here than we did with base, but first lets walk through one of the examples we did previously, aggregating. We can do this with
First, we’ll look at an example of grouping a data frame and summarizing the data within those groups.
There are many other functions in
dplyr that are useful. Much of what they do, can certainly be accomplished with base R, but not quite as intuitively. Let’s run through some examples with
arrange() will re-order a data frame based on the values of a columns. It will take multiple columns and can be in descending or ascending order. I think
iris is getting a bit tired, let’s try a different stock data frame this time:
mtcars. If you are interested you can try
data() to see what is available.
slice() which accomplishes what we did with the numeric indices before. Remembering back to that, we’d could grab rows of the data frame with something like
mutate() allows us to add new columns based on expressions applied to existing columns
Lastly, one more function,
rowwise() allows us run rowwise, operations. Let’s use a bit of a contrived example for this.
We now have quite a few tools that we can use to clean and manipulate data in R. We have barely touched what both base R and
dplyr are capable of accomplishing, but hopefully you now have some basics to build on. I personally think the database connection in
dplyr are going to prove very useful.
Let’s practice some of these last functions with our NLA data.
##Exercise 3 Add a new section to our script to calculate the nla water quality means.
nla_datathat we created in the previous exercise.
- Add some lines to your script to calculate the mean by LAKE_ORIGIN, for TURB, NTL, PTL,CHLA, and SECMEAN. Save to a data frame named nla_wq_means_orig.
- Repeat the same analysis but for the
WSA_ECO9ecoregions. Save this to a data frame named
- It might be interesting to compare the grouped means to the means of each value for the entire dataset. Using
summarize(), calculate the mean wq for all lakes (hint: no groups!). Save this as