This is an RMarkdown page that is running an analysis of Minnesota death certificate data on people whose deaths involved opioids such as heroin, oxycodone, methadone, fentanyl, etc.
We are going to run through all the basic steps including loading packages, importing a .csv file of data, using the Tidyverse package called “dplyr” to run some analysis (similar to Pivot Tables in Excel) and then use a package called “ggplot2” to create a graphic of some of our findings. Finally we will “knit” this page in order to turn it into an HTML page where we can share our results with others.
If you haven’t already, read this Pros and Cons of R.
You can download the RMarkdown page (called IntroAnalysisWithR.Rmd) used to build this webpage AND the data for this analysis (called “opiate_deaths.csv”) from this github repo
Save the RMarkdown file to a new directory on your computer. Create a sub-directory called “data” and put the opiate_deaths.csv file in that sub-directory. Open RStudio and go to the File menu and choose “New Project.” It will ask you to pick a directory – choose “Existing Directory” and navigate to the one where you stored in the RMarkdown page.
You’ll name your project (whatever you want to name it) and it will create a file with a .RProj extension on it. That’s the file you’ll use to open your project in the future.
You should see that new file, plus your RMarkdown file and a “data” sub-folder in the Files tab on the right lower pane in RStudio. You can click on the data sub-folder to open that folder and confirm that your csv file is in there.
Open the RMarkdown page and you’ll see all this text you were just reading.
In an RMarkdown page you generally run each chunk of code separately. There are a number of ways to run a chunk. Put your cursor in the chunk you want to run. One option is to go to the “Run” button at the top of the page and choose one of the options from the pull-down list. There are also keyboard shortcuts for either launching a single line of code (In Windows: Control+Enter ) or an entire chunk (In Windows: Control+Shift+Enter).
The first time you use R (and if you install an updated version) you will need to “install” packages. But every time you use R for any reason, you then need to “load” any of the installed packages that you want to use in your code.
Below I’ve provided some syntax that could be used in all your projects. It checks to see if the needed packages are installed and will install any that are missing.
Below that there are a series of library() commands to essentially turn on those packages for use in this particular RMarkdown page. You will need that every time you use R. However, the particular libraries/packages you need might differ from one project to another.
#This little piece of code checks to see if you have these packages installed on your computer and it will install them if you don't.
packages <- c("tidyverse", "janitor","ggthemes", "lubridate", "kntir", "htmltools", "rmarkdown", "kableExtra")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
install.packages(setdiff(packages, rownames(installed.packages())), repos = "http://cran.us.r-project.org")
}
library(readr) #importing csv files; part of Tidyverse
library(dplyr) #general analysis; part of Tidyverse
library(ggplot2) #making charts ; part of Tidyverse
library(ggthemes) #this will help make our charts look better
library(janitor) #for data cleanup
library(lubridate) #for working with date fields
#we'll need these new packages for creating this markdown page to an HTMl page
library(rmarkdown)
library(knitr)
library(htmltools)
library(kableExtra)
The data file we are going to use is stored in a sub-directory to our working directory. The sub-directory is called “data.” Our data file is called “opiate_deaths.csv” and it is a comma-separated values text file.
We are going to use the readr package (which is part of Tidyverse) to import this file. Readr has a function called “read_csv” that is designed to import csv files.
This code below is the most basic we would need to import this data file. However, if we were to use this, readr would guess incorrectly and set our date fields to character. We also have a field showing the age of the person who died that we want to ensure comes in as a numeric field. If we don’t do this, we will have trouble sorting by dates or calculating an average age, for example.
deaths <- read_csv('./data/opiate_deaths.csv')
So we’re going to one of the optional arguments that come with the read_csv function that allows us to set the column types.
In the code below, it’s setting the default to character (“c”), but then it’s expressly telling R to set birthdate, deathdate and injury_date to a date format (col_date). Notice that col_date needs us to tell it how the date is displayed in our csv file. In this case, the dates show up as “5/5/2015”. This website nicely explains the various options if your date fields are stored differently.
At the bottom of our code we are using a pipe operator (%>%) to have it also do one more thing as part of the import. It’s going to run a function called clean_names() from the janitor package. This is a handy little tool that will convert all those uppercase field names to lowercase, ditch extra spaces in column names and things like that. R is case sensitive and doesn’t like spaces or weird symbols in its column names, so this function comes in handy quite often. I tack it on to the end of my imports every time. Notice, however, that when i referred to my column names earlier in the code, I put them the way they are in the csv file because clean_names() hasn’t done it’s work yet.
deaths <- read_csv('./data/opiate_deaths.csv',
col_types=cols(.default="c",
BIRTHDATE=col_date("%m/%d/%Y"),
DEATHDATE=col_date("%m/%d/%Y"),
INJURY_DATE=col_date("%m/%d/%Y"),
AGEYEARS="i")) %>%
clean_names()
The dplyr package is very similar to Structured Query Language (SQL) and it is one way to create the equivalent of Pivot Table where you summarize your data. See my R Cheat Sheet for more on the basic syntax.
Let’s start with simply selecting a few of the columns to display in our output.
Every query you run will start with the name of your dataframe (in this case we called it “deaths”) and then use a pipe operator (%>%) to “link” operations. In this example below, our second operation is to simply display a few of the columns from our data. To do this we use the “select” syntax from dplyr.
deaths %>% select(firstname, lastname, deathdate)
## # A tibble: 3,807 x 3
## firstname lastname deathdate
## <chr> <chr> <date>
## 1 RONNIE BLANKS 2005-01-01
## 2 MICHAEL JOHNSON 2005-01-09
## 3 DONNIE COOPER 2005-01-15
## 4 RANDY SCHLAGEL 2005-01-18
## 5 TERRY LUND 2005-01-20
## 6 ANTHONY ELLIOT 2005-01-21
## 7 GERALD MERCER 2005-01-23
## 8 MARK ALBRECHT 2005-01-25
## 9 BRADLEY GYSLAND 2005-01-26
## 10 DOROTHY DAUM 2005-01-26
## # ... with 3,797 more rows
Now we’ll add one more line of code, using the filter command to limit the number of rows that are returned in our output. In this case, we are asking for all the women who are identified as African American.
deaths %>%
filter(gender=='F' , race=='African American') %>%
select(firstname, lastname, gender, deathdate)
## # A tibble: 105 x 4
## firstname lastname gender deathdate
## <chr> <chr> <chr> <date>
## 1 MARIA FULTZ F 2005-04-30
## 2 CYNTHIA WATSON F 2005-05-10
## 3 KEYDAH BARRY F 2005-05-15
## 4 YOLANDA ONCHWARI F 2005-09-02
## 5 DELOLA REED F 2006-04-19
## 6 NANCY TERRY F 2006-05-11
## 7 VICTORIA GOLDEN F 2006-07-10
## 8 PAMELA SAIN F 2007-05-07
## 9 KIMBERLY COLLIER F 2007-05-21
## 10 MAXINE BATES F 2007-07-31
## # ... with 95 more rows
Who died in January 2015? This tells it to filter for deaths after or equal to 1/1/2015 and before 2/1/2015. Note the format of the dates (yyy-mm-dd) because that’s how it’s stored in the data frame.
deaths %>%
filter(deathdate>='2015-01-01', deathdate<'2015-02-01') %>%
select(lastname, firstname, gender, deathdate)
## # A tibble: 29 x 4
## lastname firstname gender deathdate
## <chr> <chr> <chr> <date>
## 1 ORCUTT MARY F 2015-01-02
## 2 KEEN TOM M 2015-01-02
## 3 EVANS JOHN M 2015-01-03
## 4 MARJAMA GRACE F 2015-01-07
## 5 JOHNSON TERRY M 2015-01-08
## 6 KLINKHAMMER KRISTY F 2015-01-08
## 7 VICK MICHAEL M 2015-01-10
## 8 NEIBER SHERYL F 2015-01-10
## 9 DROBNICK JENNIFER F 2015-01-11
## 10 CHADDOCK KATHERINE F 2015-01-15
## # ... with 19 more rows
This tells it to show people who were Chinese or Japanese. Note the pipe character (|) is used to indicate “or”. In the previous examples the comma was used to indicate “and”.
deaths %>%
filter(race=='Chinese' | race=='Japanese') %>%
select(firstname, lastname, ageyears, race)
## # A tibble: 6 x 4
## firstname lastname ageyears race
## <chr> <chr> <int> <chr>
## 1 AKIKO ROBIDEAUX 34 Japanese
## 2 CHRISTOPHER TAM 24 Chinese
## 3 RYAN BISSONETTE 20 Japanese
## 4 NICHOLAS MCCLELLAN 23 Japanese
## 5 CHARLES LIN 83 Chinese
## 6 ALEXANDER WLODKOWSKI 16 Chinese
There are a lot of other things to learn about filtering data, but we’ll save that for later.
You probably noticed that in the last couple batches of code, each piece is on its own line. I’ve indented them on purpose to make it easier to read. The key is that the pipe (%>%) needs to be at the end of a line (never at the beginning). Once you hit enter after the pipe, RStudio will automatically know you want to indent.
deaths %>%
filter(race=='Chinese' | race=='Japanese') %>%
select(firstname, lastname, ageyears, race)
Up until now, we’ve been picking out pieces from our dataframe – a few rows, a few columns. But usually we want to be able to summarize our data to find totals, percentages, averages and things like that.
The summarize (or “summarise” will also work) function in dplyr allows us to generate counts, sum up values in a column, calculate averages, etc.
The example below is simply counting the number of records in our data frame. Note the “n()” syntax – this is how you tell R to count records, and that I’m creating a column that I’m calling “numrecords.” The column will only appear in this temporary output. It will not be stored in a table. I’ve added some extra spaces to make the code easier to read.
deaths %>% summarize(numrecords = n() )
## # A tibble: 1 x 1
## numrecords
## <int>
## 1 3807
Most often, though, we want to summarize by “groups” within our data; this is what we usually use Pivot Tables for in Excel. For example, how many people of each gender died from opioid-related deaths?
Note that now I’ve added another piece – group_by() – and am telling it to form the groups based on the values in the “gender” column in our data.
deaths %>%
group_by(gender) %>%
summarise(numdeaths = n())
## # A tibble: 2 x 2
## gender numdeaths
## <chr> <int>
## 1 F 1393
## 2 M 2414
The query above was simply counting the number of records where the gender field listed “M” (male) and how many listed “F” (female). But we can use summarize to do all kinds of calculations.
For example, what was the average age by gender?
We’ll use the mean() function to calculate an average on our “ageyears” column. Note I’m making a new column called “avg_age.”
deaths %>%
group_by(gender) %>%
summarize(avg_age = mean(ageyears))
## # A tibble: 2 x 2
## gender avg_age
## <chr> <dbl>
## 1 F 44.3
## 2 M 39.6
We can also use summarize() to sum the values in a column. Unfortunately, this data only has one numeric field (ageyears) and it doesn’t make sense to sum those values.
But imagine you had a data frame listing each expenditure you made in your household last year and you want to get a total for each month. It would look something like this, assuming your data has a column indicating the month each expense occurred and a column called “amount_spent” identifying how much the item cost:
myhousehold_expenses %>%
group_by(month) %>%
summarize(monthly_total = sum(amount_spent))
Let’s return to our death data.
Sometimes we want to count or sum things based on 2 values. For example, we want to find out how many of the people who died were both women and African American.
deaths %>%
group_by(gender, race) %>%
summarise(numdeaths = n())
## # A tibble: 27 x 3
## # Groups: gender [2]
## gender race numdeaths
## <chr> <chr> <int>
## 1 F African American 105
## 2 F American Indian 131
## 3 F American Indian-multirace 4
## 4 F Asian-multirace 1
## 5 F Black-multirace 6
## 6 F Japanese 1
## 7 F Other Asian 6
## 8 F Other Race 5
## 9 F Unknown 7
## 10 F White 1114
## # ... with 17 more rows
Let’s add one more piece from dplyr – arrange() – so we can see our results in a prescribed order. For example, that last query we ran was very long. It might be nice to see the biggest group at the top. The default behavior for arrange() is ascending order, so in order to get the biggest on top (descending order), we need to add “desc” as part of the code.
deaths %>%
group_by(gender, race) %>%
summarise(numdeaths = n()) %>%
arrange(desc(numdeaths))
## # A tibble: 27 x 3
## # Groups: gender [2]
## gender race numdeaths
## <chr> <chr> <int>
## 1 M White 1982
## 2 F White 1114
## 3 M African American 229
## 4 M American Indian 137
## 5 F American Indian 131
## 6 F African American 105
## 7 M Other Race 19
## 8 F White-multirace 13
## 9 M White-multirace 13
## 10 F Unknown 7
## # ... with 17 more rows
There are times you want to limit the results in your output in some way or another. You can do that by adding a filter command at the end of your query. In this example, we’re telling it to only give us results for groups where there were more than 100 deaths.
deaths %>%
group_by(gender, race) %>%
summarise(numdeaths = n()) %>%
arrange(desc(numdeaths)) %>%
filter(numdeaths>100)
## # A tibble: 6 x 3
## # Groups: gender [2]
## gender race numdeaths
## <chr> <chr> <int>
## 1 M White 1982
## 2 F White 1114
## 3 M African American 229
## 4 M American Indian 137
## 5 F American Indian 131
## 6 F African American 105
This is one of my favorite features of dplyr. Mutate() allows you to add a new column, either to a temporary output or add it to a data frame. We’ll start by adding a new column to our output. Remember the query about how many were men and how many were women? Wouldn’t it be great to show percentages?
Notice that mutate() comes after summarize(). This is important because mutate is going to take the column we created in summarize() – numdeaths – and generate a percentage from it. Because we’re grouping by gender, this is taking the numdeaths for the gender and dividing it by the total of the numdeaths column. And we’re creating a new column called “pct”.
deaths %>%
group_by(gender) %>%
summarize(numdeaths = n()) %>%
mutate(pct = numdeaths/sum(numdeaths)) %>%
arrange(desc(pct))
## # A tibble: 2 x 3
## gender numdeaths pct
## <chr> <int> <dbl>
## 1 M 2414 0.634
## 2 F 1393 0.366
Let’s put a bunch of things together in one query. We’ll filter to only the women, then group them by race to find out how many women of each race died, what the average age was in each group and the percentage of the total women that each group accounts for.
We’ll add a couple more things to fancy this up. First, you’ll see that we’re using the round function (from Base R) to round our percentage to one decimal point. We’re also multiplying it by 100 to make it a whole number. Note: the location of the parentheses on that calculation is really important and easy to get wrong.
deaths %>%
filter(gender=='F') %>%
group_by(race) %>%
summarise(numdeaths = n(), avgage = round(mean(ageyears),1)) %>%
mutate(pct = round((numdeaths/sum(numdeaths)*100),1)) %>%
arrange(desc(pct))
## # A tibble: 11 x 4
## race numdeaths avgage pct
## <chr> <int> <dbl> <dbl>
## 1 White 1114 45.4 80
## 2 American Indian 131 38.1 9.4
## 3 African American 105 42.9 7.5
## 4 White-multirace 13 36.5 0.9
## 5 Unknown 7 35.3 0.5
## 6 Black-multirace 6 28.3 0.4
## 7 Other Asian 6 55 0.4
## 8 Other Race 5 37.6 0.4
## 9 American Indian-multirace 4 39.5 0.3
## 10 Asian-multirace 1 40 0.1
## 11 Japanese 1 34 0.1
One of the questions we might have for this data is a simple one: How many people died each year? Our data, however, has the date of death, but not a separate field identifying the year. So let’s create one. We’ll use a package called lubridate that has a year() function, which will strip the year out of the date. Note: The date field needs to be stored as a true date field (not character). Remember how we fixed that on the import? This is why.
#this is going to overwrite our existing data frame, pull everything from that existing data frame and then use mutate() -- from dplyr -- to add a new column called "deathyr"
deaths <- deaths %>% mutate(deathyr = year(deathdate) )
Let’s look at what that shows us:
deaths %>%
group_by(deathyr) %>%
summarize(numdeaths = n())
## # A tibble: 13 x 2
## deathyr numdeaths
## <dbl> <int>
## 1 2005 168
## 2 2006 187
## 3 2007 192
## 4 2008 237
## 5 2009 277
## 6 2010 237
## 7 2011 301
## 8 2012 310
## 9 2013 325
## 10 2014 335
## 11 2015 344
## 12 2016 401
## 13 2017 493
Let’s start by looking at the hispanicethnicity field
# let's look at this field to see where we've got a problem
deaths %>% group_by(hispanicethnicity) %>% summarise(count=n())
## # A tibble: 8 x 2
## hispanicethnicity count
## <chr> <int>
## 1 hispanic 1
## 2 HISPANIC 75
## 3 non-hispanic 1
## 4 NON-HISPANIC 3
## 5 NOT-HISPANIC 2
## 6 not hispanic 1
## 7 NOT HISPANIC 3718
## 8 UNKNOWN 6
Notice the wide variety of ways that Hispanic and Non-Hispanic are listed? In order to get a good count, we need to make them consistent. This is called standardizing data.
We are going to create a new column – called hispanic_new – and use a function called case_when() to fix the problematic values.
#case_when is from dplyr: https://dplyr.tidyverse.org/reference/case_when.html
deaths <- deaths %>%
mutate(hispanic_new =
case_when(hispanicethnicity=='Non-Hispanic' |
hispanicethnicity=='NOT HISPANIC' | hispanicethnicity=='NOT-HISPANIC' |
hispanicethnicity=='non-hispanic' | hispanicethnicity=='NON-HISPANIC'~'NOT HISPANIC',
TRUE~toupper(hispanicethnicity)))
# the base R function called "toupper()" converts values to uppercase
#let's see our new column
deaths %>%
group_by(hispanic_new) %>%
summarise(numdeaths = n())
## # A tibble: 3 x 2
## hispanic_new numdeaths
## <chr> <int>
## 1 HISPANIC 76
## 2 NOT HISPANIC 3725
## 3 UNKNOWN 6
If you want to export some or all of your data, you can use write.csv
It will spit out a file to your working directory
#Let's spit out a subset of data -- in this case, the number of deaths by year
deaths_by_year <- deaths %>%
group_by(deathyr) %>%
summarize(numdeaths = n())
#this will send a csv to your working directory
#if you don't include row.names it will add a column, numbering each row 1 through x.
write.csv(deaths_by_year, 'deaths_by_year.csv', row.names=FALSE)
#Another way to export a sub-set of your data is to include a filter in your write.csv code.
#In this example, I'm going to export all the columns from the main table, but only for the women who died
write.csv(deaths %>% filter(gender=='F'), 'women_deaths.csv', row.names=FALSE)
#You can send the file to a sub-directory like this (this assumes you have a sub-directory called "output")
#write.csv(deaths %>% filter(gender=='F', './output/women_deaths.csv', row.names=FALSE))
About knitr
Push the Knit button at the top of the page and it will automatically create an HTML file, as long as there are no errors in your code.
The code at the top of the page that says “echo=TRUE” means that the HTML page will include your code chunks and the results. Try running this page one time with echo=TRUE, open the HTML page (in your working directory) and see what it looks like. Then chase echo=FALSE and knit the page again and see how it has changed.
Also notice that the results of our queries don’t look very pretty. It is possible to make prettier tables using kable() and kableExtra(), which come with the kntir package.