Multi-modal Learning Data Collection at (Small) Scale

subtitle: even the best-laid plans…

Last year (spring 2015) we collected a really nice set of data of students collaborating in groups of three. The data collection process wasn’t entirely smooth or perfect, but it generally went off without any major technical or logistical problems. We ended up with a really nice dataset of almost 150 students with high quality audio data (four channels per group), video recordings (one per group), and computer log files (ideally one per group, practically more than one). [NB: The annotated audio from this first phase of data collection will be made available soon to other researchers. You can read the paper about the data set (presented at Interspeech 2016) here.]

In the spring of 2016 we set off to do our second phase of data collection, in classrooms during a regular class period. So unlike the first phase where we had just two groups at a time with kids who had volunteered and were excited to try out some math problems (a.k.a. the best kids), we had up to 10 groups at once with varying levels of excitedness and/or willingness to follow directions. We mostly wanted to test out how well the audio capture worked with all of the background noise in a typical classroom environment and see if our speech models still held up.

Continue reading “Multi-modal Learning Data Collection at (Small) Scale”

R Basics

This is the second part in an ongoing series I’m doing about why I think R is awesome and why you should be using it. (Check out part one!)

So now that you have downloaded and installed RStudio and have some data you want to play with, what are the next steps? How do you get started really working with your data? In this post I’ll cover an overview of the basics of working with R. Future posts will have more details on some of these topics.

Project spaces and working directories

So RStudio has you create a “project” when you get started. You tell it where you want the project to be and then it creates a file with “.Rproj” at the end. The location where this project resides is also your working directory. This will be relevant when trying to load in data

You can have more than one project (in different places if you want) and I have found creating multiple projects is mostly helpful for keeping different R projects separate. For instance, I have a main R project called “R Stuff” and then also separate projects for a couple of the bigger research projects that I work on. Things not attached to one of those two bigger research projects go in R Stuff and then I sort them out later and move them if they grow into their own thing.

My suggestion is to create most of your code/scripts/whatever in an R script file (extension .R) instead of just using the console to type in commands when you need them. You can load one of these in the main RStudio panel and type and edit your code here. Once you have some code/commands you like, you don’t need to copy them down into the console, you can just hit command-return (on a Mac, probably control-return on Windows) (or use the “Run” command in the upper right corner of that main window.

This script will allow you to do a couple of things: first, you can see your whole data manipulation/analysis/graphing workflow all at once; second, you can make changes to one step (e.g., switching the size of your graphed data points) and then re-run the code easily; third, you can write comments.

Now, I am not always the best at writing comments. But I try. And it’s really important. Even if you don’t think anyone else is ever going to see your code, you might need to look at it later. And no matter how smart and clever you think you are (well, actually I think if you’re super cleve then this is going to be more important because on a future day you may not be having a super clever day), you will probably need to read your code again. You are always, at a minimum, collaborating with yourself. And you deserver to have well-commented and documented code. So do yourself a favor and write some sensible comments.

Loading and viewing your data

Ok, so you have a data file and you want to start working with it. You have a few options. Most likely, it’s a .csv file and I’m going to assume to start that it’s in your working directory so you can use the command

d1 <- read.csv("MyDataFile.csv")

This will create a new dataset called d1 that is made up of what was in your csv file. You can use the “Import Dataset” button in the Environment panel. If your data file is in another location, you will have to enter the correct file path.

For the rest of the examples here, I’m going to use one of the sample data sets that comes with some R packages. The mpg dataset is one of the typical datasets for examples, as it comes in the base package. It is a datatset of car models and gas mileage data. Play along at home with the following commands.

To start, load the dataset: data(mpg). This should create an entry in the Data section of the Environment panel on the right. It should tell you the name of the dataframe and that there are 234 observations of 11 variables. Alright, but if we want to look at the data? If you type head(mpg) the console will output the header of the dataframe: the column names and the first six rows of data.

I prefer using glimpse(mpg) which is actually a command from the dplyr package. (If you haven’t already downloaded the dplyr package, now is a good time. We will be using it a lot in later posts.) Glimpse gives you a more compact view of more of the dataset and also tells you how R is interpreting each variable. For instance, R thinks that manufacturer is a factor (true) and that year is an integer (also true). displ is a “double integer” which is a bit weird, but for now, let’s just go with that it’s a special class of numerical variable. None of the text-based variables showed up as strings, which is good for our purposes with this dataset.

This is fine if you have a relatively small dataset, but it begins to get unwieldy if you have a lot of variables. The summary(mpg) call will give you a different view of your data. For the text-based variables, it gives you a count of them (up to a point) and for the numerical variables, it spits out the minimum, quartiles, mean, and maximum values. Pretty handy for a quick check.

If you want to see the whole dataset (or at least, a lot more of it, depending on how big it is) in a format more closely resembling that which you’re used to in Excel or something, you can use View(mpg). This will pop up a “normal” looking dataset in the main window for you to peruse.

Alright, now that we have looked at our data, let’s talk about variables. To access a specific variable, you will use the dollar sign. So, if you want to look at (or refer to) the model variable in the dataframe, you will call it by mpg$model. This way R knows that you are looking in the dataframe mpg and you want the variable model. You can use this in combination with lots of other things. For instance, if you wanted to find the minimum year of car that is in the dataset, you could use min(mpg$year) and it should output 1999.

If you make some changes to your dataset (e.g., adding a variable, reshaping it, filtering it, etc. — all topics for a future post), you can also save your dataset in a recognizable format. So if your new dataframe is called mpg2 you can export a csv of that using write.csv(mpg2, file="mpg2"). This will put a new csv file in your working directory with the filename mpg2.csv.

Other things to think about with R

In order to maintain an up-to-date version of R within RStudio, there are three separate things you need to update: RStudio itself (the application), R (the base), and all of your packages.

Updating your packages is easy in RStudio. In the Packages tab in the lower right corner (using the default set-up), there is an “Update” button that will easily show you which packages have updates available and let’s you download and install them. Super easy. (Updating RStudio is easy too: look in the Help menu (at least on Macs).)

When you start up RStudio, the console will give you a readout of the current version of R that you are running. As of today, that is version 3.2.2 (“Fire Safety”), but if you have an earlier version of R — as long as it’s not too old — most things should run fine. Updating R is sometimes a pain because you can’t do it directly in RStudio (which I think is confusing to people because you can update your packages easily in RStudio). When you download a new version of R, RStudio will automatically detect that, so that’s not too bad. However, RStudio tries to be helpful and store your downloaded packages in the correct place, but a major version update to R actually creates a new location and you have to migrate all of your packages over to that new place. It’s a bit of a hassle, but there is an easy way around it.

update.packages(checkBuilt = T, ask = F, type = "binary")

RStudio also has support for version control. Woo! You can use either git or SVN. I have more experience with SVN, but I am in the midst of switching over to git so maybe I’ll post about that at a later date. I’m not going to go into all of the details for how to set up and use version control, but, I will say that it’s a good idea even if you don’t need it for collaboration or sharing purposes.


Next post: we’ll look at how to organize and manipulate your data using my favorite package dplyr!

Visualizing without seeing

Last January was the annual Awesome Games Done Quick marathon, where speed runners1 show off and explain their skills while raising money for cancer research. One of the final events of the marathon was a blindfolded speed run of the beginning of The Legend of Zelda: Ocarina of Time (OoT) (basically, the first three dungeons). Yes, you read that correctly: a speed runner Runnerguy2489 was blindfolded and then played OoT.
Continue reading “Visualizing without seeing”

Intro to R

This is the first in a series of posts on the statistics language R.

Do you work with data (doing data processing, analysis, or visualization)? Are you currently using SPSS, Excel, or SAS and you know it sucks but aren’t sure you want to try something new? Have you heard about R and are scared to try it? Have you tried R but are super confused? Do you currently use R but want to know more? If you answered “yes” to any or all of those questions, then this post (and/or one of the following in the series) is for you! R is a free and open source alternative to SPSS, SAS, and other analysis and statistics programs.
Continue reading “Intro to R”

Effect size

There are a lot of different ways to think about comparing two things, or more appropriately perhaps, two sets of things. If they are things we can count, we can easily see which there are more of. If they are more like a score, we can easily see which set has a higher score. We can also fairly easily see what the distribution of the things in each set are, although comparing the distributions is a bit more tricky.

Using some basic statistics measures, we can tell whether or not the two sets of things are different from each other using significance testing. This is typically done with a t-test or an analysis of variance (ANOVA) or a similar measure. These types of measures, based on the mean and variance of a set of data points, are simple and easy to calculate (especially with a basic stats program) and have therefore become commonplace in the research literature. But unfortunately, their simplicity ends up hiding a lot of information and potentially interesting nuance.
Continue reading “Effect size”

I am thankful for the public funding of science

Today I am thankful for my friends and family and being born into a middle class family in a first world country and all of that normal Thanksgiving stuff. But there is something else that I am thankful for that I want to call attention to. I am thankful for the public funding of science.
Continue reading “I am thankful for the public funding of science”

Vector Graph Racer

I love vectors. I mean, they’re pretty awesome. They have a magnitude and a direction. Two for the price of one. They are also quite helpful when doing physics. [Full disclosure: my dissertation had a big focus on vectors, so I’m a little biased.] I also really like graph paper. So you can imagine how excited I was when I stumbled upon a paper-and-pencil game that used vectors as the main mechanic.

The game goes by many names and has been around for a long time. I first saw it as Graph Racer, but I like Vector Racer too. There’s a whole article on the rules and variants on Wikipedia. The game is always different because you draw the board each time you play. So how does it work?
Continue reading “Vector Graph Racer”