About

This is an RMarkdown page that is running an analysis of Minnesota death certificate data on people whose deaths involved opioids such as heroin, oxycodone, methadone, fentanyl, etc.

We are going to run through all the basic steps including loading packages, importing a .csv file of data, using the Tidyverse package called “dplyr” to run some analysis (similar to Pivot Tables in Excel) and then use a package called “ggplot2” to create a graphic of some of our findings. Finally we will “knit” this page in order to turn it into an HTML page where we can share our results with others.

If you haven’t already, read this Pros and Cons of R.

How to run this code on your own

You can download the RMarkdown page (called IntroAnalysisWithR.Rmd) used to build this webpage AND the data for this analysis (called “opiate_deaths.csv”) from this github repo

Save the RMarkdown file to a new directory on your computer. Create a sub-directory called “data” and put the opiate_deaths.csv file in that sub-directory. Open RStudio and go to the File menu and choose “New Project.” It will ask you to pick a directory – choose “Existing Directory” and navigate to the one where you stored in the RMarkdown page.

You’ll name your project (whatever you want to name it) and it will create a file with a .RProj extension on it. That’s the file you’ll use to open your project in the future.

You should see that new file, plus your RMarkdown file and a “data” sub-folder in the Files tab on the right lower pane in RStudio. You can click on the data sub-folder to open that folder and confirm that your csv file is in there.

Open the RMarkdown page and you’ll see all this text you were just reading.

How to “run” a chunk of code

In an RMarkdown page you generally run each chunk of code separately. There are a number of ways to run a chunk. Put your cursor in the chunk you want to run. One option is to go to the “Run” button at the top of the page and choose one of the options from the pull-down list. There are also keyboard shortcuts for either launching a single line of code (In Windows: Control+Enter ) or an entire chunk (In Windows: Control+Shift+Enter).

Load packages

The first time you use R (and if you install an updated version) you will need to “install” packages. But every time you use R for any reason, you then need to “load” any of the installed packages that you want to use in your code.

Below I’ve provided some syntax that could be used in all your projects. It checks to see if the needed packages are installed and will install any that are missing.

Below that there are a series of library() commands to essentially turn on those packages for use in this particular RMarkdown page. You will need that every time you use R. However, the particular libraries/packages you need might differ from one project to another.

#This little piece of code checks to see if you have these packages installed on your computer and it will install them if you don't.
packages <- c("tidyverse", "janitor","ggthemes", "lubridate", "kntir", "htmltools", "rmarkdown", "kableExtra")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())), repos = "http://cran.us.r-project.org")  
}

library(readr) #importing csv files; part of Tidyverse
library(dplyr) #general analysis; part of Tidyverse
library(ggplot2) #making charts ; part of Tidyverse

library(ggthemes) #this will help make our charts look better
library(janitor) #for data cleanup
library(lubridate) #for working with date fields

#we'll need these new packages for creating this markdown page to an HTMl page
library(rmarkdown)
library(knitr)
library(htmltools)
library(kableExtra)

Import data

The data file we are going to use is stored in a sub-directory to our working directory. The sub-directory is called “data.” Our data file is called “opiate_deaths.csv” and it is a comma-separated values text file.
We are going to use the readr package (which is part of Tidyverse) to import this file. Readr has a function called “read_csv” that is designed to import csv files.
This code below is the most basic we would need to import this data file. However, if we were to use this, readr would guess incorrectly and set our date fields to character. We also have a field showing the age of the person who died that we want to ensure comes in as a numeric field. If we don’t do this, we will have trouble sorting by dates or calculating an average age, for example.

deaths <- read_csv('./data/opiate_deaths.csv')

So we’re going to one of the optional arguments that come with the read_csv function that allows us to set the column types.
In the code below, it’s setting the default to character (“c”), but then it’s expressly telling R to set birthdate, deathdate and injury_date to a date format (col_date). Notice that col_date needs us to tell it how the date is displayed in our csv file. In this case, the dates show up as “5/5/2015”. This website nicely explains the various options if your date fields are stored differently.

At the bottom of our code we are using a pipe operator (%>%) to have it also do one more thing as part of the import. It’s going to run a function called clean_names() from the janitor package. This is a handy little tool that will convert all those uppercase field names to lowercase, ditch extra spaces in column names and things like that. R is case sensitive and doesn’t like spaces or weird symbols in its column names, so this function comes in handy quite often. I tack it on to the end of my imports every time. Notice, however, that when i referred to my column names earlier in the code, I put them the way they are in the csv file because clean_names() hasn’t done it’s work yet.

deaths <- read_csv('./data/opiate_deaths.csv',
                col_types=cols(.default="c",
                BIRTHDATE=col_date("%m/%d/%Y"),
                DEATHDATE=col_date("%m/%d/%Y"),
                INJURY_DATE=col_date("%m/%d/%Y"),
                AGEYEARS="i")) %>%
  clean_names()

Analysis with dplyr

The dplyr package is very similar to Structured Query Language (SQL) and it is one way to create the equivalent of Pivot Table where you summarize your data. See my R Cheat Sheet for more on the basic syntax.
Let’s start with simply selecting a few of the columns to display in our output.
Every query you run will start with the name of your dataframe (in this case we called it “deaths”) and then use a pipe operator (%>%) to “link” operations. In this example below, our second operation is to simply display a few of the columns from our data. To do this we use the “select” syntax from dplyr.

deaths %>%  select(firstname, lastname, deathdate)
## # A tibble: 3,807 x 3
##    firstname lastname deathdate 
##    <chr>     <chr>    <date>    
##  1 RONNIE    BLANKS   2005-01-01
##  2 MICHAEL   JOHNSON  2005-01-09
##  3 DONNIE    COOPER   2005-01-15
##  4 RANDY     SCHLAGEL 2005-01-18
##  5 TERRY     LUND     2005-01-20
##  6 ANTHONY   ELLIOT   2005-01-21
##  7 GERALD    MERCER   2005-01-23
##  8 MARK      ALBRECHT 2005-01-25
##  9 BRADLEY   GYSLAND  2005-01-26
## 10 DOROTHY   DAUM     2005-01-26
## # ... with 3,797 more rows

Select and filter

Now we’ll add one more line of code, using the filter command to limit the number of rows that are returned in our output. In this case, we are asking for all the women who are identified as African American.

deaths %>% 
   filter(gender=='F' , race=='African American') %>% 
   select(firstname, lastname, gender, deathdate)
## # A tibble: 105 x 4
##    firstname lastname gender deathdate 
##    <chr>     <chr>    <chr>  <date>    
##  1 MARIA     FULTZ    F      2005-04-30
##  2 CYNTHIA   WATSON   F      2005-05-10
##  3 KEYDAH    BARRY    F      2005-05-15
##  4 YOLANDA   ONCHWARI F      2005-09-02
##  5 DELOLA    REED     F      2006-04-19
##  6 NANCY     TERRY    F      2006-05-11
##  7 VICTORIA  GOLDEN   F      2006-07-10
##  8 PAMELA    SAIN     F      2007-05-07
##  9 KIMBERLY  COLLIER  F      2007-05-21
## 10 MAXINE    BATES    F      2007-07-31
## # ... with 95 more rows

Other ways to filter

Who died in January 2015? This tells it to filter for deaths after or equal to 1/1/2015 and before 2/1/2015. Note the format of the dates (yyy-mm-dd) because that’s how it’s stored in the data frame.

deaths %>%
  filter(deathdate>='2015-01-01', deathdate<'2015-02-01') %>% 
  select(lastname, firstname, gender, deathdate)
## # A tibble: 29 x 4
##    lastname    firstname gender deathdate 
##    <chr>       <chr>     <chr>  <date>    
##  1 ORCUTT      MARY      F      2015-01-02
##  2 KEEN        TOM       M      2015-01-02
##  3 EVANS       JOHN      M      2015-01-03
##  4 MARJAMA     GRACE     F      2015-01-07
##  5 JOHNSON     TERRY     M      2015-01-08
##  6 KLINKHAMMER KRISTY    F      2015-01-08
##  7 VICK        MICHAEL   M      2015-01-10
##  8 NEIBER      SHERYL    F      2015-01-10
##  9 DROBNICK    JENNIFER  F      2015-01-11
## 10 CHADDOCK    KATHERINE F      2015-01-15
## # ... with 19 more rows

This tells it to show people who were Chinese or Japanese. Note the pipe character (|) is used to indicate “or”. In the previous examples the comma was used to indicate “and”.

deaths %>%
  filter(race=='Chinese' | race=='Japanese') %>% 
  select(firstname, lastname, ageyears, race)
## # A tibble: 6 x 4
##   firstname   lastname   ageyears race    
##   <chr>       <chr>         <int> <chr>   
## 1 AKIKO       ROBIDEAUX        34 Japanese
## 2 CHRISTOPHER TAM              24 Chinese 
## 3 RYAN        BISSONETTE       20 Japanese
## 4 NICHOLAS    MCCLELLAN        23 Japanese
## 5 CHARLES     LIN              83 Chinese 
## 6 ALEXANDER   WLODKOWSKI       16 Chinese

There are a lot of other things to learn about filtering data, but we’ll save that for later.

Indenting

You probably noticed that in the last couple batches of code, each piece is on its own line. I’ve indented them on purpose to make it easier to read. The key is that the pipe (%>%) needs to be at the end of a line (never at the beginning). Once you hit enter after the pipe, RStudio will automatically know you want to indent.

deaths %>%
  filter(race=='Chinese' | race=='Japanese') %>% 
  select(firstname, lastname, ageyears, race)

Summarizing our data

Up until now, we’ve been picking out pieces from our dataframe – a few rows, a few columns. But usually we want to be able to summarize our data to find totals, percentages, averages and things like that.
The summarize (or “summarise” will also work) function in dplyr allows us to generate counts, sum up values in a column, calculate averages, etc.
The example below is simply counting the number of records in our data frame. Note the “n()” syntax – this is how you tell R to count records, and that I’m creating a column that I’m calling “numrecords.” The column will only appear in this temporary output. It will not be stored in a table. I’ve added some extra spaces to make the code easier to read.

deaths %>% summarize(numrecords = n() )
## # A tibble: 1 x 1
##   numrecords
##        <int>
## 1       3807

Pivot Table

Most often, though, we want to summarize by “groups” within our data; this is what we usually use Pivot Tables for in Excel. For example, how many people of each gender died from opioid-related deaths?
Note that now I’ve added another piece – group_by() – and am telling it to form the groups based on the values in the “gender” column in our data.

deaths %>%
  group_by(gender) %>% 
  summarise(numdeaths = n())
## # A tibble: 2 x 2
##   gender numdeaths
##   <chr>      <int>
## 1 F           1393
## 2 M           2414

The query above was simply counting the number of records where the gender field listed “M” (male) and how many listed “F” (female). But we can use summarize to do all kinds of calculations.
For example, what was the average age by gender?
We’ll use the mean() function to calculate an average on our “ageyears” column. Note I’m making a new column called “avg_age.”

deaths %>% 
  group_by(gender) %>% 
  summarize(avg_age = mean(ageyears))
## # A tibble: 2 x 2
##   gender avg_age
##   <chr>    <dbl>
## 1 F         44.3
## 2 M         39.6

We can also use summarize() to sum the values in a column. Unfortunately, this data only has one numeric field (ageyears) and it doesn’t make sense to sum those values.
But imagine you had a data frame listing each expenditure you made in your household last year and you want to get a total for each month. It would look something like this, assuming your data has a column indicating the month each expense occurred and a column called “amount_spent” identifying how much the item cost:

myhousehold_expenses %>% 
  group_by(month) %>% 
  summarize(monthly_total = sum(amount_spent))

Let’s return to our death data.

Group by two variables

Sometimes we want to count or sum things based on 2 values. For example, we want to find out how many of the people who died were both women and African American.

deaths %>% 
  group_by(gender, race) %>% 
  summarise(numdeaths = n())
## # A tibble: 27 x 3
## # Groups:   gender [2]
##    gender race                      numdeaths
##    <chr>  <chr>                         <int>
##  1 F      African American                105
##  2 F      American Indian                 131
##  3 F      American Indian-multirace         4
##  4 F      Asian-multirace                   1
##  5 F      Black-multirace                   6
##  6 F      Japanese                          1
##  7 F      Other Asian                       6
##  8 F      Other Race                        5
##  9 F      Unknown                           7
## 10 F      White                          1114
## # ... with 17 more rows

Sort our results

Let’s add one more piece from dplyr – arrange() – so we can see our results in a prescribed order. For example, that last query we ran was very long. It might be nice to see the biggest group at the top. The default behavior for arrange() is ascending order, so in order to get the biggest on top (descending order), we need to add “desc” as part of the code.

deaths %>% 
  group_by(gender, race) %>% 
  summarise(numdeaths = n()) %>% 
  arrange(desc(numdeaths))
## # A tibble: 27 x 3
## # Groups:   gender [2]
##    gender race             numdeaths
##    <chr>  <chr>                <int>
##  1 M      White                 1982
##  2 F      White                 1114
##  3 M      African American       229
##  4 M      American Indian        137
##  5 F      American Indian        131
##  6 F      African American       105
##  7 M      Other Race              19
##  8 F      White-multirace         13
##  9 M      White-multirace         13
## 10 F      Unknown                  7
## # ... with 17 more rows

Filter summarized results

There are times you want to limit the results in your output in some way or another. You can do that by adding a filter command at the end of your query. In this example, we’re telling it to only give us results for groups where there were more than 100 deaths.

deaths %>% 
  group_by(gender, race) %>% 
  summarise(numdeaths = n()) %>% 
  arrange(desc(numdeaths)) %>% 
  filter(numdeaths>100)
## # A tibble: 6 x 3
## # Groups:   gender [2]
##   gender race             numdeaths
##   <chr>  <chr>                <int>
## 1 M      White                 1982
## 2 F      White                 1114
## 3 M      African American       229
## 4 M      American Indian        137
## 5 F      American Indian        131
## 6 F      African American       105

Mutate

This is one of my favorite features of dplyr. Mutate() allows you to add a new column, either to a temporary output or add it to a data frame. We’ll start by adding a new column to our output. Remember the query about how many were men and how many were women? Wouldn’t it be great to show percentages?
Notice that mutate() comes after summarize(). This is important because mutate is going to take the column we created in summarize() – numdeaths – and generate a percentage from it. Because we’re grouping by gender, this is taking the numdeaths for the gender and dividing it by the total of the numdeaths column. And we’re creating a new column called “pct”.

deaths %>% 
  group_by(gender) %>% 
  summarize(numdeaths = n()) %>% 
  mutate(pct = numdeaths/sum(numdeaths)) %>% 
  arrange(desc(pct))
## # A tibble: 2 x 3
##   gender numdeaths   pct
##   <chr>      <int> <dbl>
## 1 M           2414 0.634
## 2 F           1393 0.366

Put it all together

Let’s put a bunch of things together in one query. We’ll filter to only the women, then group them by race to find out how many women of each race died, what the average age was in each group and the percentage of the total women that each group accounts for.
We’ll add a couple more things to fancy this up. First, you’ll see that we’re using the round function (from Base R) to round our percentage to one decimal point. We’re also multiplying it by 100 to make it a whole number. Note: the location of the parentheses on that calculation is really important and easy to get wrong.

deaths %>% 
  filter(gender=='F') %>% 
  group_by(race) %>% 
  summarise(numdeaths = n(), avgage = round(mean(ageyears),1)) %>% 
  mutate(pct = round((numdeaths/sum(numdeaths)*100),1)) %>% 
  arrange(desc(pct))
## # A tibble: 11 x 4
##    race                      numdeaths avgage   pct
##    <chr>                         <int>  <dbl> <dbl>
##  1 White                          1114   45.4  80  
##  2 American Indian                 131   38.1   9.4
##  3 African American                105   42.9   7.5
##  4 White-multirace                  13   36.5   0.9
##  5 Unknown                           7   35.3   0.5
##  6 Black-multirace                   6   28.3   0.4
##  7 Other Asian                       6   55     0.4
##  8 Other Race                        5   37.6   0.4
##  9 American Indian-multirace         4   39.5   0.3
## 10 Asian-multirace                   1   40     0.1
## 11 Japanese                          1   34     0.1

Add a new column

One of the questions we might have for this data is a simple one: How many people died each year? Our data, however, has the date of death, but not a separate field identifying the year. So let’s create one. We’ll use a package called lubridate that has a year() function, which will strip the year out of the date. Note: The date field needs to be stored as a true date field (not character). Remember how we fixed that on the import? This is why.

#this is going to overwrite our existing data frame, pull everything from that existing data frame and then use mutate() -- from dplyr -- to add a new column called "deathyr"

deaths <-  deaths %>% mutate(deathyr = year(deathdate) )

Let’s look at what that shows us:

deaths %>%
  group_by(deathyr) %>%
  summarize(numdeaths = n())
## # A tibble: 13 x 2
##    deathyr numdeaths
##      <dbl>     <int>
##  1    2005       168
##  2    2006       187
##  3    2007       192
##  4    2008       237
##  5    2009       277
##  6    2010       237
##  7    2011       301
##  8    2012       310
##  9    2013       325
## 10    2014       335
## 11    2015       344
## 12    2016       401
## 13    2017       493

Cleaning dirty data

Let’s start by looking at the hispanicethnicity field

# let's look at this field to see where we've got a problem
deaths %>% group_by(hispanicethnicity) %>% summarise(count=n())
## # A tibble: 8 x 2
##   hispanicethnicity count
##   <chr>             <int>
## 1 hispanic              1
## 2 HISPANIC             75
## 3 non-hispanic          1
## 4 NON-HISPANIC          3
## 5 NOT-HISPANIC          2
## 6 not hispanic          1
## 7 NOT HISPANIC       3718
## 8 UNKNOWN               6

Notice the wide variety of ways that Hispanic and Non-Hispanic are listed? In order to get a good count, we need to make them consistent. This is called standardizing data.

How to recode

We are going to create a new column – called hispanic_new – and use a function called case_when() to fix the problematic values.

#case_when is from dplyr: https://dplyr.tidyverse.org/reference/case_when.html

deaths <-  deaths %>%
  mutate(hispanic_new = 
           case_when(hispanicethnicity=='Non-Hispanic' | 
hispanicethnicity=='NOT HISPANIC' | hispanicethnicity=='NOT-HISPANIC' | 
  hispanicethnicity=='non-hispanic' | hispanicethnicity=='NON-HISPANIC'~'NOT HISPANIC',
TRUE~toupper(hispanicethnicity)))

# the base R function called "toupper()" converts values to uppercase


#let's see our new column

deaths %>% 
  group_by(hispanic_new) %>% 
  summarise(numdeaths = n())
## # A tibble: 3 x 2
##   hispanic_new numdeaths
##   <chr>            <int>
## 1 HISPANIC            76
## 2 NOT HISPANIC      3725
## 3 UNKNOWN              6

Export data

If you want to export some or all of your data, you can use write.csv
It will spit out a file to your working directory

#Let's spit out a subset of data -- in this case, the number of deaths by year

deaths_by_year <-  deaths %>%
  group_by(deathyr) %>%
  summarize(numdeaths = n())

#this will send a csv to your working directory
#if you don't include row.names it will add a column, numbering each row 1 through x.
write.csv(deaths_by_year, 'deaths_by_year.csv', row.names=FALSE)



#Another way to export a sub-set of your data is to include a filter in your write.csv code.
#In this example, I'm going to export all the columns from the main table, but only for the women who died
write.csv(deaths %>% filter(gender=='F'), 'women_deaths.csv', row.names=FALSE)


#You can send the file to a sub-directory like this (this assumes you have a sub-directory called "output")
#write.csv(deaths %>% filter(gender=='F', './output/women_deaths.csv', row.names=FALSE))

Knit to HTML

About knitr
Push the Knit button at the top of the page and it will automatically create an HTML file, as long as there are no errors in your code.
The code at the top of the page that says “echo=TRUE” means that the HTML page will include your code chunks and the results. Try running this page one time with echo=TRUE, open the HTML page (in your working directory) and see what it looks like. Then chase echo=FALSE and knit the page again and see how it has changed.
Also notice that the results of our queries don’t look very pretty. It is possible to make prettier tables using kable() and kableExtra(), which come with the kntir package.