Here are a few things that are helpful to know when getting started
In Tidyverse, the dplyr package is the primary tool you’ll use for data analysis. The syntax is very similar to Structured Query Language (SQL).
Anywhere it says “df” is where you would insert your dataframe or tibble name that holds your data. This is case-sensitive.
Anywhere it says “column_name” is where you would insert the name of a column in your dataframe. This is case-sensitve.
Anywhere it says “df_new”, the code is making a new column on the fly and you can name it whatever you want.
Select columns to view in output
df %>% select(column_name, columname)
Filter which rows are shown in output
Note the double equal sign
df %>% filter(column_name== 'value')
Summarize your data by groups (like a Pivot Table)
tablename %>% group_by(column_name)
Do some math as part of that summarizing
Count the records:
df %>%
group_by(column_name) %>%
summarize(new_column = n() )
Sum the values in a column:
df %>%
group_by(column_name) %>%
summarize(new_column = sum(column_name))
Average values in a column:
df %>%
group_by(column_name) %>%
summarize(new_column = mean(column_name) )
Note: for median you will need the stats package.
Sort your results by a column
This will sort in ascending order
df %>%
group_by(column_name) %>%
summarize(new_column = n() ) %>%
arrange(column_name)
Readr package in Tidyverse comes with 5 “parsers” for importing different text file types.
read_csv() – for comma-delimited files
read_tsv() – for tab-delimited files
read_fwf() – for fixed-width files
read_log() for web log files
(Note that BaseR has a function called “read.csv” that functions very differently)
If you need to import an Excel file, you need the readxl package from Tidyverse.
There are other packages for other file types, including JSON or SPSS. And there are also ways to get data via APIs, but this cheat sheet does not cover these.
df <- read_csv("mydatafile.csv")
If your data file is in a sub-directory of your working directory, you’ll need to specify the path to that directory. I this example my sub-directory is called “data”. Note the period before the slash to indicate to go down one directory from the working directory:
df <- read_csv("./data/mydatafile.csv")
Read_csv will guess the column formats (character, numeric, double, date, etc) based on the first 1,000 rows in the file. You can tell it took at more rows, or you can specify column types as part of the import code.
df <- read_csv("./data/mydatafile.csv") ,
col_types= cols(.default = col_character(),
column_name1 = col_double(),
column_name2 = col_date(format= "%m/%d/Y") ) )
The example above tells it to default to character for all fields, except column_name1 and column_name2 which are set to specific types. For date and datetime, you need to tell R what the format is of the dates coming in. In the example above, I told it that my data has dates that appear like “5/10/2018” (m/d/yyyy) More info on this.
Sometimes you get an Excel file where the data starts several rows down and there’s a bunch of crap either above or below the data that you don’t want to bring in. The readxl package lets you specify exactly what to pull. Note: the sheet and range arguments in the code below are optional.
df <- read_excel("mydatafile.xlsx", sheet="Sheet1", range="B14:E312")
Joining is how you put two data frames together side-by-side, matching on one or more idnumbers or other columns that match in the two data frames. Your new data frame will have more columns, not necessarily more rows, depending on the join type. More info here.
##Join types:
Inner_join – this will return only the records that match between the two tables
Left_join – this will return all the records from the first table listed, but only the ones that match from the second table listed.
Right_join – the opposite of a left_join. All records from second table.
Anti_join – this will return all the records from the first table that are NOT IN the second table.
Full_join – this will return all the records from both tables, regardless if they match
Note: column_name1 and column_name2 are the columns from your two data frames that match each other. Sometimes this is an ID number, sometimes it’s a name. It is possible to join two tables on more than 1 matching column, but that is not shown here.
df_new <- inner_join(df1, df2, by=c("column_name1"= "column_name2") )
This basic syntax will put the two data frames together and will put all the columns (from both tables) into a new data frame. It will not repeat the “joining” column if they have the same name. If there are other columns with the same names, your new columns will be named with a “.x” and “.y” attached to the end of the name to identify which table they came from. X refers to the first table listed in the join. Y refers to the second.
If you want to only bring a few columns from the second table (and avoid having duplicate columns), you can do that as part of the join. Note that we include a pipe and select command before the comma and you MUST include any columns that are being used in the join.
df_new <- inner_join(df1, df2 %>% select(idnumber, col1, col2, col3) , by=c("column_name1"= "column_name2") )
Typically you would use this when you have two data files – hopefully with all the same columns and column names – that you want in one data file. In other words, you’re adding the rows from one to the bottom of the other data frame.
Tidyverse has a function called “bind_rows” that will do this, but thankfully it doesn’t insist that the number of columns (and their names) be identical between the two data frames. Base R has a function called “rbind” that does require these to be exact. So bind_rows tends to be more flexible.
Here we’re going to make a new data frame by binding two existing data frames together:
df_new <- bind_rows(df1, df2)
Bind_rows will line up the columns in df1 with those in df2 that have the same name. If you have a field in df1 called “state” and a column in df2 called “statename”, bind_rows will put those in separate columns in your new data frame. And if you have a column that exists in one table but not the other, that will still show up as a column in your new data frame.
#Renaming columns
There are a couple different ways to rename a column. This syntax is what you’d likely use when you are importing or building a new data frame.
For example, this would create a data frame with just 4 columns selected from the old one and it will rename col1 in the process:
df_new <- df %>%
select(col1, col2, col3, col4)%>%
rename(newname = col1)
If you just want to keep your data frame with all the columns, but rename one or more fields you can do this:
df <- df %>%
rename(newname1 = col1, newname2 = col2)
You can also rename on the fly in your analysis. In this example, the field is currently called “gender” but I want to change it to “sex” in my output (it won’t be saved to the dataframe).
df %>% select(sex=gender, firstname, lastname)