This is the first in a series of posts on the statistics language R.
Do you work with data (doing data processing, analysis, or visualization)? Are you currently using SPSS, Excel, or SAS and you know it sucks but aren’t sure you want to try something new? Have you heard about R and are scared to try it? Have you tried R but are super confused? Do you currently use R but want to know more? If you answered “yes” to any or all of those questions, then this post (and/or one of the following in the series) is for you! R is a free and open source alternative to SPSS, SAS, and other analysis and statistics programs.
What is R?
So what is R? Well, it’s an open source alternative to other statistics programs such as SPSS, Stata, and SAS. It’s an object-oriented language based on the S programming language and it is specifically designed to be used for data analysis, visualization, and other statistics and data-related uses.
Why should you use R?
Or, more precisely, why should you use R instead of another statistics program. Most of us that need something like R were trained using another program. In my case, I learned SPSS in graduate school. And it seemed ok. But there are a lot of reasons why you should switch to R or start learning R instead of one of those other programs.
First of all, R is free. Woo! That’s way better than SPSS or SAS or whatever. But the main reasons why it’s better do not have to do with money. To start with, it’s because R has a large and growing user base. R users are also very visible online and are very helpful when problems arise (especially on Stack Overflow and #rstats on twitter).
Additionally, and perhaps more importantly, because R is open source it will inevitably be part of the larger push for more transparency and reproducibility in publishing and presenting research findings. There is an increasing push for including code (and data) with journal articles and R may well become the lingua franca for communicating methods and results. Competency in R may be necessary for parsing others’ results and also for your sharing your own.
R is also very compatible with version control systems which will make your whole workflow better and less prone to data loss and make it easier to collaborate with others.
However, R may not be for you. Or, at least not right now. There is a relatively steep learning curve, especially for people not already familiar with object-oriented languages or programming generally. Programs like SPSS have GUIs meant to help newer users be more comfortable. R doesn’t have that, although some options for using R are easier than others (see the section below). I believe that learning and using R is a good exercise and is worth the effort needed to understand it.
R can be downloaded from CRAN. There are a bunch of options for how to use R. The most basic is to use R in the terminal. This used to be the only option and I think of lot of people might have looked at R a while ago and were confused about it (especially if they were more used to using something like SPSS). For a while there was an Excel plug-in you could use, but it’s not great and I think it only makes it harder to use R the “proper” way.
Luckily, now there is RStudio which is an IDE (an integrated development environment) for R. It makes it easier to work with your data files, write code scripts, and flip through the visualizations you have made. It also has integrated help files, code highlighting, and other features that are typically found in IDEs.
To start, you’re probably going to want to load in your data (I’m going to assume you have a csv file). There are two ways to do this in RStudio. One, select Import Dataset in the Environment tab (it’s probably in the upper right corner of your screen). Two, in the console (or in a script) you can use the command
read.csv() with your filename in the parentheses. There are other options of course, but those are the the easiest ones to get started.
R comes with a bunch of base functions, but there are also a ton of packages you can download to add more functionality. These other packages include documentation and sometimes sample data sets to help you get familiar with them. Some of my favorite packages are:
As mentioned above, this is the first in a series on R. Later posts will cover data manipulation with the package
dplyr, data visualization with the package
ggplot2, and creating interactive web-based apps with the
The second post is up: on R Basics.