This is the fourth part in an ongoing series on how and why you should be using R, especially if you are a social science researcher or education researcher, like me. If you missed the earlier ones, you can check out part 1 (Intro to R), part 2 (R Basics), and part 3 (Data Cleaning and Manipulation). This post will go into some more specifics relating to data visualization.
There are many ways to visualize your data using R. By far the most popular (and I think robust and flexible) is using the
ggplot2 package. This post will talk a bit about why and how to visualize your data and some tips and basics to using R’s
ggplot2 package to help you achieve your visualization goals.
There are lots of reasons why you might want to visualize your data (or rather, why you should visualize your data). It can be a useful tool at various stages of research, and depending on where you are in your analysis process, different aspects of visualization might be more or less important to focus on. The way I see it, there are three main purposes for data visualization: examining your data, showing your data/findings, and sharing your data/findings.
What question are you trying to answer with your data? How can a visualization help you answer that? Do you have a really complex data set that is too hard to easily capture with a few numbers? Are you interested in variation and distribution rather than just means and medians? Are you exploring different relationships between variables and want to see how they interact?
My process for creating a visualization: (1) get to know your data and have a good sense about what data is available, what form/structure it is currently in, and what kinds of visualizations might be possible or desired; (2) make a drawing of what you think you want to create (you can draw it in your head sometimes too); (3) get your data into the right format to make that visualization; (4) use
ggplot2 to make a basic version of your visualization; (5) refine and tweak the visualization based on feedback and additional needs.
So, what do these five steps look like in more detail?
- It’s good to know about different types of visualizations that are available (e.g., scatter plots, box plots, heat maps) and which types are appropriate for the type of data you have (continuous, categorical, time-series, etc.). Having a sense of the available options for visualization will make it easier to brainstorm in the next step, so it’s a good idea to try to mentally catalog different visualizations that you encounter in your daily life (or from reading other research articles or data journalism). Some good options for collecting example visualizations are: Flowing Data, R graph catalog, Dear Data project, and the Data is Beautiful sub-reddit.
- Be creative! Some things that help with this stage: whiteboards, graph paper, multi-colored pens and notebooks, colored pencils, etc. I like having different colored pens/pencils available because it helps to think about the multiple variables that I can show on one graph or how I want to show groups or trends. It also helps to know what your data looks like so when you sketch out some example graphs or figures you can include dummy data that is not too far away from reality. (This is kind of like writing pseudo-code but for graphs.) This can give you a sense of whether or not the graph you imagine in your head is something that you can create with your data in its current state.
- I wrote a whole blog post on this! Check out my post on data manipulation/merging using dyplr.
ggplot2is my preferred visualization package, but there are lots of other good ones (for Python and Matlab as well as the default base R plotting functions). (See below for the basics on how
ggplotworks. It is different than other plotting tools you might have used.) Although default plotting settings are usually not great for your particular visualization (and usually not for a “finished” visualization), the main key here in this stage of the process is to get something to start with and look at (and maybe show to someone else) and then make edits and changes from there. Think of it as the sacrificial first draft, like you would with some writing. Try out some different things. Look at the histogram of each of your (numerical) variables. Do you have a grouping variable? Color your data with the group info. Or, maybe you want to create multiple smaller graphs to show the different categories or groups in your data. Try lots of options.
- Feedback is so important. If your process up until now has been collaborative, then this may not be as important of a step since you should have been getting feedback the whole time. But many times, this stage is the main point at which you share your viz with others. The feedback could be from your collaborators or friendly colleagues (and yourself, if possible). Sometimes it’s a good idea to make a couple different versions of a visualization so that others have a choice to make about which is better – this mostly helps to start a conversation about why one viz might be preferable to another, which can help generate ideas about other ways to improve it. Likely one visualization is highlighting a specific aspect of your data or relationships and so one might be preferable to another. You also might have inadvertently used colors that make it hard for someone with colorblindness to disambiguate the groups (more on that below). This step might include transferring the visualization to a Shiny app for more interactivity or using some of the capability of various htmlwidgets packages. The end of step 5 might be to present your visualization at a conference or add it into a paper. It might also just be something that helps you contextualize and understand your data better so that you’re able to do other analyses.
How does ggplot work?
The “gg” in
ggplot stands for the “grammar of graphics”. It is based on the idea that a graph is created using the components of data, a coordinate system, and a way of representing data points with a particular type of mark (called “geoms” here). Every
ggplot call has the same basic structure and it is up to you to provide it with the right components to make the graph you want. It works as layers, so sometimes it’s helpful to think about providing a basic structure for your graph and then layering on additional information (like shapes and colors).
Let’s start with making a simple scatter plot. We will use the
mpg dataset. To make a scatter plot with displacement and highway mpg data:
ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point(). The first part of the call is telling
ggplot which dataset to use. The
aes() is very important. This is the aesthetics mapping and it tells R what variables should be used (and on which axis). (If you are making a plot, like a histogram, that only has one variable, then you can just use x.) That first part has created the basic structure of the plot, the data and the coordinate system. The next part, after the plus sign, tells R how to show you the data. There are many different geom options (see the cheatsheet – linked at the bottom of this post – for all of them). But the ones you’ll use most at the beginning will likely be
geom_point() for a scatter plot,
geom_histogram() for a histogram,
geom_col() for a bar/column graph, and
geom_boxplot() for boxplots.
Ok now that you’ve made a very basic plot, we can add in some elements and options to further show off your data. There are a few that I use regularly and are pretty common ways to change the look and feel of your data: color, fill, shape, size, and alpha. With each of these options, it is important to remember that they can be used in two different ways. So you need to think about what kind of variable you want to change (i.e., is it numeric or categorical) and how you want to change it.
Let’s start with color. The two options are to change all of your data points to a new thing (e.g., make all of these points blue) or to map your data points onto a color set in some way. The easiest thing is to just change all of your points to a new color. You can do this by just adding
color = "blue" to your geom:
ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue").1 But you can also use the aesthetic mapping option to map a color onto a variable. So let’s say that we want that same graph but we want to color the data points by the class of car. We will want to put
color=class into an
aes() wrappers, just like we did for the x and y information. It will look like this:
ggplot(mpg, aes(displ, hwy)) + geom_point(aes(color = class)). Note two things here: 1)
class does not have quotation marks around it like
blue did in the prior example. Class is a variable in the dataset and R knows that and is expecting it to be called directly. Blue is a reference to a color that is part of a list of character strings.2 Also 2) the call to
aes() can go either here in the geom function or it can go earlier in the ggplot function. However, keep in mind that anything you put into the initial ggplot function, it will apply to all of the layers that are added on after.
In this example,
class was a categorical variable and so R knew to map it in that way. If we used
cty instead, it would have mapped the data differently. Try that out and see how different it looks.
The same basic principle applies to the other aesthetic options. Think about whether or not you want to map your variable onto a shape/size or just change all of the data points to a certain shape/size. Fill works very similar to color, but depending on what kind of geom you are using, this might be the thing you want to change. If you’re graphing something and it doesn’t seem to do what you think, try
fill instead of
color and see if that helps. There is a list of all of the 25 shapes and their id numbers on the RStudio Cheatsheet. You can also provide a single character to the argument, e.g.,
shape = "d" to make all of your points look like the letter “d”.3
Alpha is kind of a special one that I think is underutilized. I use it all the time, especially when I’m at the early stages and trying to get a good sense of my data. I sometimes have a large amount of data, with many overlapping points, and changing the alpha level can help me see the overlap.4 Alpha is basically the transparency of the data point. An alpha = 1 would be fully solid and alpha = 0 would be fully transparent. I usually try something around 0.4 and then adjust up or down based on the data set. Typically I do not want to map the alpha onto a variable, so it could look something like this:
ggplot(mpg, aes(displ, hwy)) + geom_point(aes(color = cty), alpha =.4).
Another element that I use frequently is faceting. Facets will create a bunch of smaller graphs, separated out by one or more variables. There are two facet functions and they work similarly:
facet_grid makes a grid (surprise) based on one or two variables and
facet_wrap will create a wrapped kind of list of graphs where you can specify the number of rows or columns that are created. So, let’s try to
facet_wrap on the
class variable with three rows of graphs:
ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_wrap(~class, nrow =3). Try it with different numbers of rows (using
nrow) or with specifying the number of columns instead (with
ncol). If you’re interested in how two variables are related, you’ll want to use
facet_grid. Here’s an example:
ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_grid(drv ~ cyl). With both of these functions there are a lot of options in how to improve and customize them. One of the ones you might want to use is
scales which can change whether or not all of the mini graphs have the same or different scales. The
scales argument takes the options
"free". Fixed means that the scales will be the same across graphs and free means they will vary with the data in that graph. Fixed scales make it easier to see patterns across panels; free scales make it easier to see patterns within panels.
Formatting your data for use with ggplot2
There may be confusion about what really counts as tidy data in some cases and whether or not this automatically means that you have a “long” dataset. For instance, you might have some student data where each row is one students’ data from a particular session of an online learning environment but there are multiple variables collected during that session. This is a tidy, but still technically wide dataset that is perfectly fine to use for data organization.
There may be some good reasons to keep your data tidy, but relatively wide for a while and then turn it into a tidy long dataset before plotting with
ggplot. I do this sometimes based on how my variables are set up and what I am planning to do with my data. Some data manipulation is easier to do like this. For instance, if you want to subset based on a group of variables, you can easily do this if those variables all have something in common in their names (e.g., select all variables that start with “quiz”). A tidy dataset is easy to transform into a wide one for certain types of
ggplot figures later on, so don’t worry too much about that now.
Getting your data in the right format can be tricky sometimes, so be patient with yourself and your data.
Tips and tricks for ggplot
Now you might remember that I mentioned earlier about how ggplot is great because it thinks about layers when building a visualization. So, you can add on lots of things beyond the basic plot elements we have already built. A lot of the time this will be adding on things like labels and theme elements to change the look of the visualization. For instance,
+ labs(title = "Here's my cool title", x="This is my x-axis label") would add a title and a new x-axis label to your graph. Another thing you can add is a statistical function. There are many of these and they (and their many options) are listed on the Cheatsheet (see Resources section below). One of the commonly used ones is
stat_smooth() that adds a smoothing function (the default is a linear model, but you can specify whatever method and formula you’d like) with an optional confidence interval (which is on by default). It looks cool and can be an easy way to display a trend. This statistical function would be added after your base ggplot call and after your regular geom call as well. You can just keep adding things on if you want!5
Color is an important element of visualizations, especially if you’re going to be presenting your work to others. The default color options are not always the best choice for what you need and there are other packages and alternate color palettes available that may work better for you. One that is now built in to
RColorBrewer to help make your plots the prettiest they can be (or, you know, just like really readable and understandable). I really like the
RColorBrewer palettes and now they are easy to add on using
scale_color_brewer() depending on what element you’re trying to change. Within the function you can just tell it
palette="Blues" or whichever one you like. You can see the whole list with
RColorBrewer::display.brewer.all(). I also really like the
viridis package if you want to be more supportive of color-blind people (which, it turns out, also makes it easier for everyone to see distinctions in the colors and, therefore, your data).
There are also SO MANY additional packages that add more functionality and options to
ggplot. I am probably only aware of a small number of them and more are available every week it seems like. A lot of these are extensions built by R community members. You can see a pretty good list of vetted ones at the ggplot extensions website. The extensions are usually adding a new geom or a new stat to your options, although some of them are themes or other ways to improve the look and feel of your visualization.
Adding a new theme is another easy way to change the look of your visualization. There are some built-in themes to
ggplot. I tend to like
theme_bw() from the options. There is also a package called
ggthemes that adds even more theme options for you. Some of these will try to match the style guide for various journals or a Tufte style plot. Or like, you can match the (ugly, in my opinion) color choices of a default Excel graph. Whatever your heart desires. The options are nearly unlimited. See all your choices listed here. (As a side note, you don’t need to use all of the elements of a theme if you don’t want to. Call the theme first, usually at the end of your other elements, and then make manual changes to the theme after that.)
Lastly, I will discuss adding elements of interactivity or movement to your visualization. There are many different ways to do this, depending on what kind of interactivity you want and how much technical expertise you have (and/or want to deal with). By far the most basic and easiest ways are to take advantage of the
plotly for basic interactivity,
leaflet for map-related interactivity, and
DT for all things data table related. You can also animate your graph with the
gganimate package. This is a less-straightforward option and requires downloading some additional software helpers on your computer. Check out the gganimate website for all of the info on how to install it and use it to make fun gifs of your graph. Who doesn’t like animated gifs?
shiny package is another way to add interactivity to your graphs, but it is by no means an easy way to do so. The next and last post in this blog series will be about the basics of Shiny and how to make web-based interactive graphics and dashboards. If you can’t wait until then, you should start with the tutorials on the Shiny website.
And lastly, like with all of my R posts, my biggest suggestion is to try things out for yourself with your own data and see how it works. And, of course, look at the cheat sheets provided by RStudio. I use them ALL the time and they are very helpful, even if you use R every day. There are too many options and nuances to have memorized them all. Use the cheatsheet.
Also, bonus if you are a University of Illinois at Urbana-Champaign graduate student: most fall semesters I teach a class on data visualization that goes into way more detail on all of these points (and many others). It’s EPSY 508 if you’re interested.
If you have questions or other tips/tricks that you have found helpful, leave them below in the comments. Thanks!
- Like other functions in R, the American or British spelling of color/colour will work just fine. ↩
- Fun fact: there are a LOT of color options built into R. You can see the whole list here: R colors ↩
- Unfortunately, R doesn’t have the full unicode library so you can’t plot emojis this way. (Yet.) There are some additional packages you can get that will let you use emojis, just not as easily as this. ↩
- Another way to see if you have lots of overlapping data is to use
- But, remember, the order of the layers DOES matter. You should think about it literally as if you are putting more layers on top when you add things. So if you want to create something where you have a lot of filled-in gray circles as points (shape = 16) with an open circle (shape = 1) in colors based on some grouping variable, you need to put them in the right order so you can see the open circle on top. ↩
One thought on “Data Visualization”