Data Cleaning and Manipulation/Organization

This is the third part in an ongoing series on how and why you should be using R. If you missed the earlier ones, you can check out part 1 (Intro to R) and part 2 (R Basics). This post will go into some more specifics relating to data cleaning, organization, and manipulation.

In my opinion, the dplyr package is a game changer for those trying to learn R. It is what motivated me from just recommending that people use R to basically demanding that my friends and co-workers switch to R. I remember the day that I finally got around to learning how to use the package’s functionality and all of the ways in which it lets you easily and clearly manipulate your data frames1. I just kind of stared at my computer screen and imagined how much better my data-life was going to be with these tools. I realized that the hours and hours I used to spend in Excel trying to massage my data into the right form were over2. Also, I wouldn’t have to decipher weird R base code anymore when trying to create new variables or filter datasets. The dplyr package and its friends make your code/scripts much easier to read which will help both you and future you in trying to decipher what is going on.

Continue reading “Data Cleaning and Manipulation/Organization”

Multi-modal Learning Data Collection at (Small) Scale

subtitle: even the best-laid plans…

Last year (spring 2015) we collected a really nice set of data of students collaborating in groups of three. The data collection process wasn’t entirely smooth or perfect, but it generally went off without any major technical or logistical problems. We ended up with a really nice dataset of almost 150 students with high quality audio data (four channels per group), video recordings (one per group), and computer log files (ideally one per group, practically more than one). [NB: The annotated audio from this first phase of data collection will be made available soon to other researchers. You can read the paper about the data set (presented at Interspeech 2016) here.]

In the spring of 2016 we set off to do our second phase of data collection, in classrooms during a regular class period. So unlike the first phase where we had just two groups at a time with kids who had volunteered and were excited to try out some math problems (a.k.a. the best kids), we had up to 10 groups at once with varying levels of excitedness and/or willingness to follow directions. We mostly wanted to test out how well the audio capture worked with all of the background noise in a typical classroom environment and see if our speech models still held up.

Continue reading “Multi-modal Learning Data Collection at (Small) Scale”

exporting Mendeley tags to csv or xls

I recently was struggling with a data problem, one that I hadn’t anticipated being so … annoying. I eventually found a solution, and thought I would share it so others wouldn’t have to go through the same thing that I did. [Update: I have a link to the files that I made at the bottom of the post.]

One of the projects that I am working on is a huge meta-analysis of research studies surrounding simulations used for learning STEM content, inquiry, and related skills. I am leading a team of researchers in this task (some of whom are at a different institution), and initially tried to figure out the best way to collaborate in reading and coding thousands of article abstracts. I decided to go with Mendeley, mostly because I knew it would be easy to get the citations into it, it’s free, it works really well, and I could set up collaborative groups to facilitate our coding process. What I hadn’t anticipated would be how difficult it would be to get the citations out of Mendeley and into a spreadsheet (which I was using for analysis of the abstract data and the next steps of our overall project). Specifically, I needed to get the tags in Mendeley out, since we were using the tag field for our coding.

[Sidenote: Using tags to code the abstracts in Mendeley actually worked pretty well in my opinion.]

Mendeley, while awesome in many respects, does not have many options for exporting data. You can do BibTeX (.bib), RIS (.ris), or EndNote XML (.xml). Through some sleuthing, I managed to figure out that BibTeX was probably my best bet as the tags were definitely part of that export (they didn’t seem to be in the xml file and I didn’t really want to deal with RIS if I didn’t have to). Now, of course, the new problem is that BibTex isn’t a spreadsheet and it’s only readable by other citation management software (and there’s a good reason why I don’t use EndNote anymore). So I still needed to get it into a csv type format. Luckily, through some interwebz searching, I found the program JabRef, which is an open source citation manager and is able to export to csv. Hooray!

However, my problems still weren’t quite over, as the default csv export filter didn’t include the tags that I wanted. Although free, open source programs are great in a lot of ways, they are usually not so great in the documentation of certain operations. And this was one of those situations.

So here’s how to make a custom csv export filter for JabRef that includes the tags that Mendeley uses. [Update: you can skip this step and just download the files I made – links at the bottom of the post.]

First, download the src files. This includes the the standard export filter layouts that you can then edit and tweak to your heart’s content. They can be found here: http://sourceforge.net/projects/jabref/files/jabref/ Open up the file openoffice-csv.layout (in your text editor of choice) and decide which fields you want to be included in your export to csv. “\mendeley-tags” is the correct way to refer to the Mendeley tags in the BibTeX file. My suggestion is to definitely delete the “\annote” field (especially, if like me, you had a bunch of random crap in the Note field in Mendeley, some of which took up a lot of space). You can of course change the order of these and delete other fields as well. After going through the rest of this process, I would have taken out more fields due to the regular expressions weirdness that came out the other side when there were strange symbols or characters (which happened a lot in the author and abstract fields – more on that later). If you don’t need the address, I would suggest deleting that since there were a bunch of formatting issues with it in the csv version.

You can also edit the openoffice-csv.begin.layout file to change the headings for the csv export – and obviously, it’s easier later on if you have ensured that the data and the headings match up. :) Make sure to rename your edited files (I called mine custom-csv.layout and custom-csv.begin.layout) and then save them somewhere convenient.

Then go back to JabRef, select Menu Options – Manage custom exports – Add new. Fill in the fields: export name – call it whatever you want; main layout file: navigate to your saved .layout file; file extension: .csv. Click Ok and Close. Then go to Menu – File – Export and your newly defined export filter should be listed there.

Now, this solution is not perfect by any means. There were definitely lots of formatting and regexp hiccups in the transformation and export processes. I’m not sure where these occurred and I’m not skilled enough at regexp to fix them. So, I had to go through my csv file and correct a lot of the issues (the concatenate function in Excel was exceedingly helpful in this regard). But, in the end I got all 2392 of my citations into an xls format and can now manipulate my data with ease.

[Update: Here are the files that I created. This is the .layout file and this is the begin.layout file.]