rstats – Cynthia D'Angelo

Toward a more open science practice with R

Recently I did a webinar with my colleague Joshua Rosenberg, hosted by the Center for Open Science, on Analyzing Educational Data with Open Science Best Practices, R, and OSF. You can find a recording of the webinar here and our slides and an example R Notebook are in an OSF repository here. I thought I would do this blog post to summarize some of the main things I talked about there and highlight some of the more important aspects.

This webinar was ostensibly about open science for educational data. I think most of us want to engage in more open science practices (which could include open data, open materials, preregistration of studies, replication, posting preprints, and reporting null results) but don’t know necessarily where to begin or what tools to use. I think we tried to make the argument that workflows, procedures, practices, and behaviors that are good practice for you by yourself, future you, and your internal team can also be good for open science. And that using R and its many packages and tools is a good way of achieving those goals.

I’ve written many times before about how much I love using R and how I want others to incorporate it more into their practice. I’ve now collected the series of blog posts as well as other related blog posts (like this one!) and slides onto one page for easy access. You can go to cynthiadangelo.com/r/ to see all of the R related stuff that I have worked on linked in one place.

In general, I’ve been thinking a lot lately about my values and commitments as a researcher and how I approach my work in a very basic way. What is important to me, my collaborators, and my field? How could I be doing things differently or looking at my assumptions differently?

This thinking led me to this set of things to consider for a more open science approach:

There are a lot of technical tools and solutions to some of the open science problems. But there are also philosophical/ethical/moral issues to consider.
Humans are participants that helped produce your data. All humans deserve respect and so do their data.
There’s no easy answer for some of these situations you might face. That’s ok. Part of what open science asks is to consider your options and document your decision making.
Reflect early on in your process about what your goals are and how you want to achieve them. What are your values? How do these match up?

Some of the tips and guidelines that I talk about in the webinar: (1) Documentation is so important. It’s also really difficult. Making things clearer for you and your future self will also make them clearer for others who might eventually see your code. (2) A tidy data structure will make things easier for you and easier for others to understand. If you’re not already on board the tidyverse train, it’s never too late to start. (3) Make sure you have a data dictionary somewhere that explains all of your variables. This sounds obvious, but it doesn’t usually happen because in the moment you think you understand all of your variables. But future you will not remember all of those things. Write it down. Preferably in a R Notebook (more on that later). (4) Pick a consistent and clear file naming convention early on in your project (preferably before you begin data collection). Think about the date format you use and think about the unit of analysis you will care about later and try to incorporate it directly into your filename to help with filtering and analysis later on. (5) Of course I want you to visualize your data. Descriptive statistics can be misleading sometimes and visualization is an important step in your process and is not just an end product.

The thing that ties all of this together is using a R Notebook within RStudio. R Notebooks make use of RMarkdown, a flavor of Markdown, my favorite way to write. It is a plain text file, so it’s easy to version control and easy to share, both things that are hugely important when thinking about open science. I really like R Notebooks because you can easily incorporate explanatory text alongside your code and figures/graphs are persistent across the page so you can scroll and easily refer back to something above or below where you are working. This, in my opinion, is a much better way to use R than the older way with scripts and the console.

R Notebooks can produce an html file that you can send to your colleague or friend who doesn’t have R installed and they will be able to open it up in a browser and see all of your wonderful thoughts and figures. It’s really great. You can also execute code in Python or JavaScript or D3 (or a few other programming languages) in addition to R, so it’s very versatile. There are a lot of output formats as well, including pdf, Word, slide decks, dashboards, and books. And they are all customizable. Check out the RMarkdown website to see all of the options and more details on how they work. For me, they dramatically changed (in a good way) how I do my work.

Maybe a good question to leave you with is to try and answer “What is the best way for you to work toward open science?” It doesn’t have to be a big thing; it can be a bunch of small changes over time. This hopefully shouldn’t feel too overwhelming. There are lots of us here to help.

Data Visualization

This is the fourth part in an ongoing series on how and why you should be using R, especially if you are a social science researcher or education researcher, like me. If you missed the earlier ones, you can check out part 1 (Intro to R), part 2 (R Basics), and part 3 (Data Cleaning and Manipulation). This post will go into some more specifics relating to data visualization.

There are many ways to visualize your data using R. By far the most popular (and I think robust and flexible) is using the ggplot2 package. This post will talk a bit about why and how to visualize your data and some tips and basics to using R’s ggplot2 package to help you achieve your visualization goals.

Why visualize?

There are lots of reasons why you might want to visualize your data (or rather, why you should visualize your data). It can be a useful tool at various stages of research, and depending on where you are in your analysis process, different aspects of visualization might be more or less important to focus on. The way I see it, there are three main purposes for data visualization: examining your data, showing your data/findings, and sharing your data/findings.

What question are you trying to answer with your data? How can a visualization help you answer that? Do you have a really complex data set that is too hard to easily capture with a few numbers? Are you interested in variation and distribution rather than just means and medians? Are you exploring different relationships between variables and want to see how they interact?

Data Cleaning and Manipulation/Organization

This is the third part in an ongoing series on how and why you should be using R. If you missed the earlier ones, you can check out part 1 (Intro to R) and part 2 (R Basics). This post will go into some more specifics relating to data cleaning, organization, and manipulation.

In my opinion, the dplyr package is a game changer for those trying to learn R. It is what motivated me from just recommending that people use R to basically demanding that my friends and co-workers switch to R. I remember the day that I finally got around to learning how to use the package’s functionality and all of the ways in which it lets you easily and clearly manipulate your data frames¹. I just kind of stared at my computer screen and imagined how much better my data-life was going to be with these tools. I realized that the hours and hours I used to spend in Excel trying to massage my data into the right form were over². Also, I wouldn’t have to decipher weird R base code anymore when trying to create new variables or filter datasets. The dplyr package and its friends make your code/scripts much easier to read which will help both you and future you in trying to decipher what is going on.

Continue reading “Data Cleaning and Manipulation/Organization”

10,000 Tweets

I have been on twitter for almost ten years. Twitter has changed a lot in that time and my enthusiasm for the platform has waned a bit over the years, but I still find it to be a compelling communication platform. Initially I used it to share about the more mundane, personal parts of life and my stresses as I finished graduate school. Lately it’s become more professionally-focused (most of the time) and more reflective of the many things that are happening in the world (but with important dog pictures also). I have met lots of people through twitter as well as listened and learned from thousands of people who I would never have met in my day-to-day life. It has helped me gain a wider audience for my academic work and has allowed me to share pictures of my awesome dog with strangers and friends alike.

I just hit 10,000 tweets (if I did this correctly then the tweet linking to this post would be number 10,000). And I thought it would be a good opportunity for me to go back through my twitter archive and get a sense of what all of those tweets were about and how I tweeted. (The analysis that follows is actually only on my first 9,945 tweets because I had to request my tweets a couple weeks ago and do the actual analysis.) This was also a fun R exercise for me¹.

Continue reading “10,000 Tweets”

R Basics

This is the second part in an ongoing series I’m doing about why I think R is awesome and why you should be using it. (Check out part one!)

So now that you have downloaded and installed RStudio and have some data you want to play with, what are the next steps? How do you get started really working with your data? In this post I’ll cover an overview of the basics of working with R. Future posts will have more details on some of these topics.

Project spaces and working directories

So RStudio has you create a “project” when you get started. You tell it where you want the project to be and then it creates a file with “.Rproj” at the end. The location where this project resides is also your working directory. This will be relevant when trying to load in data

You can have more than one project (in different places if you want) and I have found creating multiple projects is mostly helpful for keeping different R projects separate. For instance, I have a main R project called “R Stuff” and then also separate projects for a couple of the bigger research projects that I work on. Things not attached to one of those two bigger research projects go in R Stuff and then I sort them out later and move them if they grow into their own thing.

My suggestion is to create most of your code/scripts/whatever in an R script file (extension .R) instead of just using the console to type in commands when you need them. You can load one of these in the main RStudio panel and type and edit your code here. Once you have some code/commands you like, you don’t need to copy them down into the console, you can just hit command-return (on a Mac, probably control-return on Windows) (or use the “Run” command in the upper right corner of that main window.

This script will allow you to do a couple of things: first, you can see your whole data manipulation/analysis/graphing workflow all at once; second, you can make changes to one step (e.g., switching the size of your graphed data points) and then re-run the code easily; third, you can write comments.

Now, I am not always the best at writing comments. But I try. And it’s really important. Even if you don’t think anyone else is ever going to see your code, you might need to look at it later. And no matter how smart and clever you think you are (well, actually I think if you’re super cleve then this is going to be more important because on a future day you may not be having a super clever day), you will probably need to read your code again. You are always, at a minimum, collaborating with yourself. And you deserver to have well-commented and documented code. So do yourself a favor and write some sensible comments.

Loading and viewing your data

Ok, so you have a data file and you want to start working with it. You have a few options. Most likely, it’s a .csv file and I’m going to assume to start that it’s in your working directory so you can use the command

d1 <- read.csv("MyDataFile.csv")

This will create a new dataset called d1 that is made up of what was in your csv file. You can use the “Import Dataset” button in the Environment panel. If your data file is in another location, you will have to enter the correct file path.

For the rest of the examples here, I’m going to use one of the sample data sets that comes with some R packages. The mpg dataset is one of the typical datasets for examples, as it comes in the base package. It is a datatset of car models and gas mileage data. Play along at home with the following commands.

To start, load the dataset: data(mpg). This should create an entry in the Data section of the Environment panel on the right. It should tell you the name of the dataframe and that there are 234 observations of 11 variables. Alright, but if we want to look at the data? If you type head(mpg) the console will output the header of the dataframe: the column names and the first six rows of data.

I prefer using glimpse(mpg) which is actually a command from the dplyr package. (If you haven’t already downloaded the dplyr package, now is a good time. We will be using it a lot in later posts.) Glimpse gives you a more compact view of more of the dataset and also tells you how R is interpreting each variable. For instance, R thinks that manufacturer is a factor (true) and that year is an integer (also true). displ is a “double integer” which is a bit weird, but for now, let’s just go with that it’s a special class of numerical variable. None of the text-based variables showed up as strings, which is good for our purposes with this dataset.

This is fine if you have a relatively small dataset, but it begins to get unwieldy if you have a lot of variables. The summary(mpg) call will give you a different view of your data. For the text-based variables, it gives you a count of them (up to a point) and for the numerical variables, it spits out the minimum, quartiles, mean, and maximum values. Pretty handy for a quick check.

If you want to see the whole dataset (or at least, a lot more of it, depending on how big it is) in a format more closely resembling that which you’re used to in Excel or something, you can use View(mpg). This will pop up a “normal” looking dataset in the main window for you to peruse.

Alright, now that we have looked at our data, let’s talk about variables. To access a specific variable, you will use the dollar sign. So, if you want to look at (or refer to) the model variable in the dataframe, you will call it by mpg$model. This way R knows that you are looking in the dataframe mpg and you want the variable model. You can use this in combination with lots of other things. For instance, if you wanted to find the minimum year of car that is in the dataset, you could use min(mpg$year) and it should output 1999.

If you make some changes to your dataset (e.g., adding a variable, reshaping it, filtering it, etc. — all topics for a future post), you can also save your dataset in a recognizable format. So if your new dataframe is called mpg2 you can export a csv of that using write.csv(mpg2, file="mpg2"). This will put a new csv file in your working directory with the filename mpg2.csv.

Other things to think about with R

In order to maintain an up-to-date version of R within RStudio, there are three separate things you need to update: RStudio itself (the application), R (the base), and all of your packages.

Updating your packages is easy in RStudio. In the Packages tab in the lower right corner (using the default set-up), there is an “Update” button that will easily show you which packages have updates available and let’s you download and install them. Super easy. (Updating RStudio is easy too: look in the Help menu (at least on Macs).)

When you start up RStudio, the console will give you a readout of the current version of R that you are running. As of today, that is version 3.2.2 (“Fire Safety”), but if you have an earlier version of R — as long as it’s not too old — most things should run fine. Updating R is sometimes a pain because you can’t do it directly in RStudio (which I think is confusing to people because you can update your packages easily in RStudio). When you download a new version of R, RStudio will automatically detect that, so that’s not too bad. However, RStudio tries to be helpful and store your downloaded packages in the correct place, but a major version update to R actually creates a new location and you have to migrate all of your packages over to that new place. It’s a bit of a hassle, but there is an easy way around it.

update.packages(checkBuilt = T, ask = F, type = "binary")

RStudio also has support for version control. Woo! You can use either git or SVN. I have more experience with SVN, but I am in the midst of switching over to git so maybe I’ll post about that at a later date. I’m not going to go into all of the details for how to set up and use version control, but, I will say that it’s a good idea even if you don’t need it for collaboration or sharing purposes.

Next post: we’ll look at how to organize and manipulate your data using my favorite package dplyr! Check it out here.

Intro to R

This is the first in a series of posts on the statistics language R.

Do you work with data (doing data processing, analysis, or visualization)? Are you currently using SPSS, Excel, or SAS and you know it sucks but aren’t sure you want to try something new? Have you heard about R and are scared to try it? Have you tried R but are super confused? Do you currently use R but want to know more? If you answered “yes” to any or all of those questions, then this post (and/or one of the following in the series) is for you! R is a free and open source alternative to SPSS, SAS, and other analysis and statistics programs.
Continue reading “Intro to R”