Data Cleaning and Manipulation/Organization

This is the third part in an ongoing series on how and why you should be using R. If you missed the earlier ones, you can check out part 1 (Intro to R) and part 2 (R Basics). This post will go into some more specifics relating to data cleaning, organization, and manipulation.

In my opinion, the dplyr package is a game changer for those trying to learn R. It is what motivated me from just recommending that people use R to basically demanding that my friends and co-workers switch to R. I remember the day that I finally got around to learning how to use the package’s functionality and all of the ways in which it lets you easily and clearly manipulate your data frames1. I just kind of stared at my computer screen and imagined how much better my data-life was going to be with these tools. I realized that the hours and hours I used to spend in Excel trying to massage my data into the right form were over2. Also, I wouldn’t have to decipher weird R base code anymore when trying to create new variables or filter datasets. The dplyr package and its friends make your code/scripts much easier to read which will help both you and future you in trying to decipher what is going on.

It was like I was looking at a whole new world ahead of me. I was so happy. I want you to be just as happy. And so here is an overview of some of my favorite and most useful features for doing data manipulation with dplyr. At the end, there are a bunch of resources to go deeper and learn more.

Tidy data

The dplyr package is based on a “tidy” dataset paradigm. It will only work well for you if your data is set up in a certain way. In a tidy dataset, each row contains one observation of your dataset and each column contains one variable. If your data is not already in this format, you should use the tidyr package to get it into shape using the gather and spread functions. (There are some analysis and visualization functions that require your data in a long format, so you might need to transform your dataset after cleaning/manipulating later on in the process. More on that in the next post on data visualization.)

The dplyr and tidyr packages are based around a set of verbs to describe what you are doing to your data. These first two, gather and spread should already give you a sense of which one will turn your wide data into long data frame (gather) and which one will reverse that process (spread).

For instance, you might have some student data where each row is one students’ data from a particular session of an online learning environment but there are multiple variables collected during that session. Having all of a particular student’s data in one row with multiple, repeated columns for different sessions is not very tidy. It would make it difficult to do certain kinds of analyses easily. In this example, the session (not the student) is the base unit of data and therefore each row in your dataset should have a unique session for a specific student in it.

Pipes

The key to writing good clean code with dplyr is using the pipe operator %>%. This will not only make it easier to understand what you are doing while you are working on your data right now, it will also be a lot easier to figure out what you did six months from now when you want to do something similar and need to look it up. Trust me, future you will be happy when it’s easy to understand what you did.

The pipe operator works to help you better see the inputs and outputs of functions and the order of operations that happen as you go through steps in your data processing.

One way to read your code with the pipe operator in place is to think of it as the phrase “and then do …”. If you are used to writing your code without this operator, you might be used to nesting your commands and working from the inside out. This type of nesting usually leads to headaches and/or error-ridden code (or at the very least, hard-to-read code). With the pipe, you will be chaining your commands, which makes it easier to see exactly what command you are doing at each step in your code and ensure that the steps are happening in the correct order and to the correct subsets of your dataset. This makes it a lot easier to deal with subsets of your data, which is a very common occurrence for the data manipulation stage.

Select and Filter

The two most important or useful commands in dplyr, in my opinion, are select and filter. These are the two main verbs for dplyr and the ones that I use constantly. Select operates on columns of data (what should be your variables if your dataset is set up correctly) and filter operates on rows (in other words, the cases in your dataset). That’s all you really need to remember.

Let’s start with this basic example below to see how the %>% pipe operator and the dplyr verbs select and filter work with a real data set. (Throughout this tutorial, I will be using the mpg dataset just like in the last post, so you can follow along at home.) We are going to create a new dataframe that selects a subset of the variables in our original dataframe and filters them for specific values.

mpg_short <- mpg %>%  
  select(model:cyl) %>%  
  filter(year >= 2000)

In this example, we want to create a new dataframe that we are going to call mpg_short. We will start with the dataset mpg and so that goes first, followed by the pipe operator. [Yes, I see that mpg_short is the first thing there – that is what we are telling R we want to save our new dataframe as.] So, in English, we start with mpg and then select some columns. Which columns? Well, select gives you a lot of options for how to select which columns you want. In this example I’ve asked it to select all of the columns between model and cyl. But you can also ask it to select columns that starts_with() or contains() some text (there’s also matches() to use with regular expressions) or just the numerical indices for the columns you want (e.g., select(1:4) for selecting columns 1 through 4). You can also easily use this to exclude certain columns by simply placing a minus sign in front of the column (e.g., mpg %>% select(-cyl) which selects all of the columns except for cyl).

The example then says to filter the rows of data for some value(s) after selecting some columns. The filter also has many options for how to filter your data. These are all based on logical operators and should be familiar to most people. In this example, I asked it to filter rows within the year variable that are greater than or equal to 2000. Important note: you are filtering for certain things, so think about it like you are selecting (see above) for certain rows. The general grammar of this usually requires the variable name to be first, then a logical operator (e.g., != or ==) and then the value to test against. R also lets you do things like filter out N/As – a very common task – and that looks a bit different since the logical operator works slightly differently: mpg %>% filter(!is.na(year)) (this asks to filter when it’s not a N/A within the year variable). Also, any of the Boolean operators work and you can also check to see if values are contained within (%in%) a group. The RStudio data transformation cheatsheet lists all of the options and is a handy reference for this type of work 3.

With select and filter you will be able to achieve many of the things you need to do to create subsets of your data for analysis. But there are lots of other things you can do with dplyr. Lots!

Creating New Variables

The other big task that is part of data manipulation is typically creating new variables. There are so many reasons you might need to do this. The verb here is mutate. Think about it like you are taking some variables that already exist and mutating them into new variables. Mutate creates new variables and adds them onto your dataset, and transmute keeps only the new ones that you have created in the dataset.

There are simple things you can do like create a new variable that does a straightforward calculation on a old variable (e.g., mutate(newvar = oldvar*100)). But dplyr also has lots of handy helpers that can do more fun things. There is cumsum that will create a new variable that is the cumulative sum of another one, ntile that will create n buckets in a new variables (think quartile or quintile), and lead and lag that will produce a variable with values shifted or lagged by 1.

In R you will use the ifelse() function to write if/then statements and make changes that way. There are so many uses for if/then statements when creating variables. I used to do this a lot in Excel, and it works very similarly in R. This would be a way to recode a variable (maybe into a binary or other number of buckets that might be more useful for analysis) or to replace values with something else. For example, mutate(colname = ifelse(is.na(colname),0,colname)) will look at the values within a variable named colname and if the value is N/A, it will replace it with a 0, otherwise it will just keep the original value.

Another thing you might need to do is find some information within a string/character variable and then create a new variable based on that information. The example below uses ifelse along with grepl, one of the regular expression functions, to search for the title “Hamilton” within a variable called musicalTitle in a dataset that I made up in my brain. This code creates a new variable that has two values: either “YayHamlet”4 or “Nope” based on whether or not it finds “Hamilton”.

mutate(newvar = ifelse(grepl('Hamilton', musicalTitle), "YayHamlet", "Nope"))

Changing Variable Type

It’s usually best to import your data so that the variables have the correct type (numeric, character, factor, time, etc.). [One way to ensure this is to use an importing function like read_csv() instead of read.csv() that will let you tell R exactly how to interpret each variable type. If you have a large number of variables this may not be a very practical option though.] There are a bunch of helper functions that can coerce your data into different forms. For instance, something might have gotten loaded as a factor (maybe you had set the option StringsAsFactors = TRUE when importing because most of your strings are indeed factors) but you need to change it to a character variable. The as.character() function is an easy way to do this.

Switching between the different variable types isn’t always obvious or straightfoward though, and you will likely run into some trouble with times and dates 5. I could probably write another whole post about dealing with times and dates in R and maybe I will. (Let me know in the comments if you want to see this post.)

You can change from one variable type to another using the helper functions (like as.numeric() and as.factor() and generally that will work out well for you. One of the transformations that is a bit tricky is going from factor to numeric. You’ll need to let R know about the factor levels in your variable so the code looks like this: as.numeric(levels(f))[f].

Grouping and Summarizing

For grouping and summarizing, we are moving a bit into data analysis, although grouping is sometimes really useful for some manipulation techniques. The main verbs here are group_by() and summarise()6. The group_by() function is really great and basically creates a copy of the data in subsets by a grouping variable. This then lets you do all sorts of things, usually applying some kind of summarizing function.

mtcars %>%
group_by(cyl) %>%
summarize(avg = mean(mpg))

This example (also in the RStudio cheatsheet) is taking the mtcars dataset, and then grouping the data by the cylinder count variable in the dataset (note: group_by is really meant for factor-type variables, so make sure you are using it on the right kind of variable) and then creating a new variable called avg that is the mean of the mpg variable for each group of cylinders. This is similar to what the mutate() function is doing, however the resulting dataframe is different. Mutate results in a dataframe of the same row length as you started and summarize ends with a dataframe with only one row per value of the grouped variable. Depending on what you want to do next (and what you’re doing to your data), one of these approaches might make more sense than the other.

Miscellaneous Data Manipulation

Dealing with NAs. NAs come up a lot in the datasets I’m used to and R has some good built-in ways to deal with them. One thing I would caution though is to always check your datasets after using the NA-related functions because they may not perform the way you were expecting (or, you might end up removing some data you didn’t want to). Use na.omit() around your data argument for plotting or piping or na.rm=TRUE when importing (“na.rm” means remove the NAs, if that helps you remember).

Easy filtering with a list
Sometimes you might not want to filter for a single value or a simple range of values like in the example we had earlier in this post. Or maybe you want to filter for non-numeric things. Then, you might want filter based on a list. You can create a vector of names or something else important and then use that to filter a variable. So let’s say I have a list of student names that were in a certain group students_group3 and I want to create a new variable group3 that says whether or not a given student in my dataset was in that group. There’s a logical operator %in% that you can use to check whether a variable’s value matches one in a given list or vector.

Extracting information from data to make changes.
To extract the last [2] characters in a string (or any set number of characters in a string): substr(mytext, nchar(mytext) - 2, nchar(mytext)) where mytext is the variable that contains the text you want to search and nchar() and substr() are functions to deal with strings. If you want to do other cool things with strings, check out the stringr package (big surprise: it’s part of the tidyverse set of packages). It can count the characters in a string str_length() or locate the position of something in a string str_locate() among many other features.

NB: n() %>% distinct() is NOT the same as n_distinct() – these do completely different things, so be careful and thoughtful when using these functions. I’ll leave it as an exercise to you to figure out how these are different.

Joining

Finally, I’ll mention joins briefly. These are very helpful formulae, but are a bit confusing at first (unless you’re really familiar with SQL and database-type queries). There are four types of joins: left, right, inner, and full. (Well, there are also semi and anti joins, but I don’t use those as much since they’re mostly for filtering and I like to filter using filter().) Because of the way I typically structure my datasets and manipulations, I mostly use left_join(). This is because I will sometimes create a subset of data in order to create some summary variables and then rejoin those with the original data set to get the summary variable attached properly to all of the original records or I’m trying to merge in some information from another table into my main dataset that I’m working with.

The left_join() will combine two data sets using a variable or set of variables to match on. So, you might want to match up two data frames based on a student ID number. For those who are well-versed in Excel, this is basically the way to do a vlookup.

Other Resources

I found this YouTube video from Data School on dplyr to be very helpful when I was first starting to figure out what was possible with dplyr. If you are intrigued by this post on dplyr I would highly recommend setting aside forty minutes of your day and watching that video. It goes into more detail about this topic and shows you lots of uses; it’s a great overview.

RStudio data wrangling cheatsheet. I use this (and the other cheatsheets) all the time. It is a great reference and you should keep it handy whenever you’re working with your data. It’s basically just always open in my Preview (pdf application) so I can switch over to it whenever I need to.

R For Data Science by Hadley Wickham. Hadley is one of the main forces behind dplyr and the other packages in the “tidyverse” as it is called. This is his newly released book. There is a free version on the web, but you can get a nice offline paper copy if you want. This covers way more than just the topic overview I have here, but it’s a great resource for anyone wanting to get serious with R.

The next installment in this series will be on data visualization using ggplot2. Let me know in the comments if you have any questions or suggestions.


  1. I use the word “manipulate” and “manipulation” throughout. Just to be clear, for non-data science people, this is sort of a generic term for data organization/cleaning/massaging/munging/whatever you want to call it. It does not mean that I am changing the data in any way that disrupts the integrity of the data set. Raw data (especially raw data coming out of online learning environments and lots of other log data) are almost never useful for visualization or analysis in their original form and so data scientists/quantitative researchers need to clean, process, reform, and filter the data set in order to make it useful. 
  2. In addition to actually being clearer and easier than manipulating data in Excel, there are lots of other advantages (some of which I wrote about in the prior posts on R). The main ones that I see are that if for some reason you get an updated (or second, similar) dataset, it is nearly trivial to run the R script you create on that new one and get the same output. If you have to re-do everything in Excel, it can take almost as long as it did the first time around. So there’s no efficiency in that approach and it’s more likely to add in errors. The other thing is that it’s much much easier to see what you are doing and to change the order of steps which is nearly impossible to do in Excel if you have a complex operation going on. In Excel you would likely have a bunch of nested if statements or vlookups that are all pointing to cells and you have to individually click into each set of cells to see and make sure you know what is there. R, and dplyr specifically, helps make your code more visible and clear. 
  3. Note: RStudio updated the data manipulation/transformation cheatsheet in January 2017 to include the some updates to the dplyr package as well as including other “tidyverse” packages that are helpful. So if you downloaded this a while ago, you might want to check out the new version. It’s very nice and might reveal some new features that you weren’t aware of. 
  4. To understand this joke, please see this amazing tweet from Lin-Manuel Miranda
  5. To deal with times and dates I recommend using the lubridate package. Times and dates are tricky and this package makes them slightly better. (Still better than the garbage way Excel deals with times and dates.) 
  6. Yes, I know that in America we typically spell this “summarize”. Both “summarise” and “summarize” will work in R, but for some reason I usually use “summarise”. I’m guessing it’s because most of the documentation uses that spelling and I’ve gotten used to it. ¯_(ツ)_/¯ 

Author: cynthiadangelo

I am a researcher, working on educational games, science education, and data visualization. I like photography, soccer, traveling, and teaching my dog new tricks.

One thought on “Data Cleaning and Manipulation/Organization”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s