I'm just getting started learning R, so I'm going to start collecting useful stuff here. It will be a work in progress. I welcome any suggestions you have to either add to this site or offer up better approaches than what I'm offering here.
Tutorial using tidyverse
I created this tutorial to walk you through how to use key elements of tidyverse package for analysis and data visualization. I find that the dplyr package in tidyverse is so similar to SQL, which has made it easier for me to understand.
Useful beginner training materials
- Microsoft edX- Introduction to R for Data Science I found this free course useful for the very basics of R, especially in getting a handle on terminology and syntax.
- Peter Aldous' Data Analysis with R training materials. This site really opened the door for me. It focuses on the SQL-like capabilities of the dplyr package and using ggplot2 for visualizations and uses real journalism examples.
- David Montgomery's R for Beginners. This was a good second tutorial to hit. Also focuses on the SQL-like capaibilities and visualizations with real journalism example, but goes a little deeper than Aldous' tutorial.
- Beginner's guide to R, by Sharon Machlis. This covers a lot of basic ground and provides a ton of useful links
- E-book: R for Data Science by Garrett Grolemund and Hadley Wickham.
- Excel vs R: A Brief Introduction to R
Useful R packages & how-to's
Most of the great capabilities that you can unleash with R come via packages (like libraries) that you need to install separately. There are thousands of packages! Here are ones I've found to be must-have's.
- Tidyverse- a collection of packages. This gives you the SQL-like capabilities and tools for making visualizations.
- ggplot2 gallery of extensions. ggplot2 comes with tidyverse, but here you can find links to a ton of extensions, including gganimate for animating your graphics.
- Aggregating and analyzing data with dplyr. Useful guide to using dplyr package (which comes with tidyverse)
- How to recode data in R
- How to plot animated maps with gganimate This code can be applied to other graphic types, as well.
- The R Graph Gallery Includes examples of code for how to make a variety of graphics with ggplot2
- Cookbook for R
- This new package looks promising: census package to scrape US Census data. I haven't tried it out yet, though.
- This new package looks fun: billboard package including scraped datasets related to the Billlboard Top 100 going back to 1960. Would be a fun dataset to use for training.
- Lubridate package for working with dates.
Working with IPUMS data in R
The Minnesota Population Center recently released an R package called IPUMSR for pulling in various IPUMS datasets, including CPS, NHGIS and IPUMS-USA. The documentation is very clear. It's important to spend some time reading about how value labels work in this vignette.
R Studio gives you several options for working with R. I'm particularly liking using R Notebook. It allows you to have your code and the results all inline with each other. It also spits out an html file with all of that. Since most of my work involves pairing up with other reporters and editors, I can see how I'll be able to share the html file with them so they can see the various tables and charts I've produced along the way. The html file allows the user to "hide all code," leaving behind just the tables and charts. Here's a notebook I made using Bureau of Labor Statistics data.
My favorite package so far is dplyr, which comes with Tidyverse. It works so much like SQL -- which I've known for a long time -- so it's been very easy to pick up. Interestingly, it seems like dplyr is more flexible and lightweight than SQL. It's possible to run a query that filters and summarizes and arranges your data, then add a new column calculating a rank or a percentage all with just five or six lines of simple code. Doing that in SQL is mind-numbing. I picked up most of dplyr through some of the tutorials linked above, but I also laid out some cash for this DataCamp class, which I found to be really helpful.
Pull data from a server
Since I store the bulk of my data on a mySQL server (on Amazon), it's really great to be able to connect R to the server. There are a few ways to do this, however it seems the route most people use is the RMySQL package. This one caused me a few headaches because I had a hard time figuring out a good way to connect without storing my username and password in the code. Supposedly you can use a "my.cnf" file, but I couldn't get that to work -- even after extensive googling on the problem. Ultimately, I settled on using an RStudio option to have it prompt you for the username and password and the host connection.
Here's what that code looks like. Note, I'm not displaying dbname here (name of the database within the server):
myconnection <- dbConnect(RMySQL ::MySQL(),
Then you can have it load a list of the tables on the server using:
Or have it list the fields for a particular table on the server:
You can run a query on the server like this:
mydata <- dbSendQuery(myconnection, "select * from tablename where gender='F'")
Then you can turn that into a dataframe, if you want:
mydf <- fetch(mydata, n=-1)
The "-1" in that last piece of code tells it to return all the records. You could change that if you, for example, only want the first 100 records or something like that
Once you have the data in a data frame, then you can use dplyr or whatever other code to analyze your data frame.
Appending files together
A common problem in data journalism is that the government agency gives you a separate data file for each year of data you requested. Or you get a data file one year. Then the next year you get an updated batch that you need to append to the first. As long as each file contains all the same columns (and column labels), there are a couple different ways (at least!) to append multiple files together.
If you want to append together all the csv files from within a given directory, Andrew Ba Tran has made this lovely package called "bulk_csv". He also has one for putting together Excel files. All of his packages are here
If you're just putting together a couple files, you can use R's rbind function. First you need to load the csv's into data frames, then you create a data frame (which I've named "merged" below) to put them together.
table1 <- read_csv("table1.csv")
merged <- rbind(table1, table2)
Interesting side note.....It's also possible to tell it to filter the records from each of the csv's and only merge together certain rows. Here's the code I used to do that on some Bureau of Labor Statistics data, pulling out only records for food establishments. (I used rbind for this since it was before I discovered bulk_csv)
Joining data frames
Another common issue for us data journalists is having to join two datasets together on one or more common fields/columns. Just like SQL, you can do an inner_join (return only matching records) or left_join (return all records from the first table listed and only those that match from the second table) using the dplyr package. Here's the official documentation on joining. In the code below you can replace the two instances of "fieldname" with the correct names of columns in your two tables that match. And replace "table1" and "table2" with the names of the two data frames you are joining. In my example, "newtable" is the name of the new data frame I'm creating from this join.
newtable <- inner_join(table1, table2, by=c("fieldname"="fieldname"))