Data Journalism:
Getting started with R

By MaryJo Webster


I'm just getting started learning R, so I'm going to start collecting useful stuff here. It will be a work in progress. I welcome any suggestions you have to either add to this site or offer up better approaches than what I'm offering here.

To get started you need to download R. And then I highly recommend using RStudio as an interface, so you'll need to download that, as well.

Tutorial using tidyverse

I created this tutorial to walk you through how to use key elements of tidyverse package for analysis and data visualization. I find that the dplyr package in tidyverse is so similar to SQL, which has made it easier for me to understand.

Useful beginner training materials

Useful R packages & how-to's

Most of the great capabilities that you can unleash with R come via packages (like libraries) that you need to install separately. There are thousands of packages! Here are ones I've found to be must-have's.

Working with IPUMS data in R

The Minnesota Population Center recently released an R package called IPUMSR for pulling in various IPUMS datasets, including CPS, NHGIS and IPUMS-USA. The documentation is very clear. It's important to spend some time reading about how value labels work in this vignette.

R Studio

R Studio gives you several options for working with R. I'm particularly liking using R Notebook. It allows you to have your code and the results all inline with each other. It also spits out an html file with all of that. Since most of my work involves pairing up with other reporters and editors, I can see how I'll be able to share the html file with them so they can see the various tables and charts I've produced along the way. The html file allows the user to "hide all code," leaving behind just the tables and charts. Here's a notebook I made using Bureau of Labor Statistics data.

dplyr

My favorite package so far is dplyr, which comes with Tidyverse. It works so much like SQL -- which I've known for a long time -- so it's been very easy to pick up. Interestingly, it seems like dplyr is more flexible and lightweight than SQL. It's possible to run a query that filters and summarizes and arranges your data, then add a new column calculating a rank or a percentage all with just five or six lines of simple code. Doing that in SQL is mind-numbing. I picked up most of dplyr through some of the tutorials linked above, but I also laid out some cash for this DataCamp class, which I found to be really helpful.

Pull data from a server

Since I store the bulk of my data on a mySQL server (on Amazon), it's really great to be able to connect R to the server. There are a few ways to do this, however it seems the route most people use is the RMySQL package. This one caused me a few headaches because I had a hard time figuring out a good way to connect without storing my username and password in the code. Supposedly you can use a "my.cnf" file, but I couldn't get that to work -- even after extensive googling on the problem. Ultimately, I settled on using an RStudio option to have it prompt you for the username and password and the host connection.

Here's what that code looks like. Note, I'm not displaying dbname here (name of the database within the server):

install.packages("RMySQL")
library(RMySQL)
myconnection <- dbConnect(RMySQL ::MySQL(),
host=rstudioapi:askForPassword("host"),
dbname="******",
user=rstudioapi:askForPassword("Database user"),
password=rstudioapi::askForPassword("Database password"))

Then you can have it load a list of the tables on the server using:
dbListTables(myconnection)

Or have it list the fields for a particular table on the server:
dbListFields(myconnection, 'tablename')

You can run a query on the server like this:
mydata <- dbSendQuery(myconnection, "select * from tablename where gender='F'")

Then you can turn that into a dataframe, if you want:
mydf <- fetch(mydata, n=-1)

The "-1" in that last piece of code tells it to return all the records. You could change that if you, for example, only want the first 100 records or something like that

Once you have the data in a data frame, then you can use dplyr or whatever other code to analyze your data frame.

Appending files together

A common problem in data journalism is that the government agency gives you a separate data file for each year of data you requested. Or you get a data file one year. Then the next year you get an updated batch that you need to append to the first. As long as each file contains all the same columns (and column labels), there are a couple different ways (at least!) to append multiple files together.

If you want to append together all the csv files from within a given directory, Andrew Ba Tran has made this lovely package called "bulk_csv". He also has one for putting together Excel files. All of his packages are here

If you're just putting together a couple files, you can use R's rbind function. First you need to load the csv's into data frames, then you create a data frame (which I've named "merged" below) to put them together.

table1 <- read_csv("table1.csv")
table2<- read_csv("table2.csv")
merged <- rbind(table1, table2)

Interesting side note.....It's also possible to tell it to filter the records from each of the csv's and only merge together certain rows. Here's the code I used to do that on some Bureau of Labor Statistics data, pulling out only records for food establishments. (I used rbind for this since it was before I discovered bulk_csv)

Joining data frames

Another common issue for us data journalists is having to join two datasets together on one or more common fields/columns. Just like SQL, you can do an inner_join (return only matching records) or left_join (return all records from the first table listed and only those that match from the second table) using the dplyr package. Here's the official documentation on joining. In the code below you can replace the two instances of "fieldname" with the correct names of columns in your two tables that match. And replace "table1" and "table2" with the names of the two data frames you are joining. In my example, "newtable" is the name of the new data frame I'm creating from this join.

newtable <- inner_join(table1, table2, by=c("fieldname"="fieldname"))