5 Manipulating data

This section is an introduction to manipulating datasets using the dplyr package. As outlined in the previous section, dplyr and ggplot2 are part of the tidyverse, which aims to provide a user-friendly framework for data science (Grolemund and Wickham 2016).

Experience of teaching R over the past few years suggests that many people find it easier to get going with data driven research if they learn the ‘tidy’ workflow presented in this section. However, if you do not like this style of R code or you are simply curious, we encourage you to try alternative approaches for achieving the similar results using base R (R Core Team 2020b)15 , the data.table R package (Dowle and Srinivasan 2019) or other languages such as Python or Julia. If you just want to get going with processing data, the tidyverse is a solid and popular starting point.

Before diving into the tidyverse, it is worth re-capping where we have got to so far as we have covered a lot of ground. Section 2 introduced R’s basic syntax; Section 3 showed how to use the Source Editor and other features of RStudio to support data science; and Section 4 introduced the concept and practicalities of R packages, with reference to stats19, ggplot2 and dplyr.

In this section, we will start with a blank slate. In Section 2 we learned that in R having a ‘clear desk’ means an empty global environment. This can be achieved by running the following command, which removes the list() of all objects returned by the function ls():

rm(list = ls())

5.1 tibbles

Although the data processing techniques in R are capable of handling large datasets, such as the crashes_2019 object that we created in the previous section, representing 100k+ casualties, it makes sense to start small. Let’s start by re-creating the crashes dataset from Section 2, but this time using the tidyverse tibble() function. This is the tidyverse equivalent of base R’s data.frame. tibble objects can be created, after loading the tidyverse, as follows:

library(tidyverse)
crashes = tibble(
  casualty_type = c("pedestrian", "cyclist", "cat"),
  casualty_age = seq(from = 20, to = 60, by = 20),
  vehicle_type = c("car", "bus", "tank"),
  dark = c(TRUE, FALSE, TRUE)
)

In the previous code chunk, we passed four vector objects as named arguments to the tibble function, resulting in columns such as casualty_type. A tibble is just a fancy way of representing data.frame objects, preferred by tidyverse users and optimised for data science. It has a few sensible defaults and advantages compared with the data.frame, one of which can be seen by printing a tibble:

class(crashes)
## [1] "tbl_df"     "tbl"        "data.frame"
crashes
## # A tibble: 3 x 4
##   casualty_type casualty_age vehicle_type dark 
##   <chr>                <dbl> <chr>        <lgl>
## 1 pedestrian              20 car          TRUE 
## 2 cyclist                 40 bus          FALSE
## 3 cat                     60 tank         TRUE

Note the <chr>, <dbl> or <lgl> text below each column, providing a quick indication of the class of each variable - this is not provided when using data.frame.

5.2 filter() and select() rows and columns

In the previous section, we briefly introduced the package dplyr, which provides an alternative to base R for manipulating objects. dplyr provides different, and some would argue simpler, approaches for subsetting rows and columns than base R.
dplyr operations for subsetting rows (with the function filter()) and columns (with the function select()) are demonstrated below. Here we can also see the use of the pipe operator %>% to take the dataset and apply the function to that dataset.

crashes %>% filter(casualty_age > 50) # filters rows
## # A tibble: 1 x 4
##   casualty_type casualty_age vehicle_type dark 
##   <chr>                <dbl> <chr>        <lgl>
## 1 cat                     60 tank         TRUE
crashes %>% select(casualty_type) # select just one column
## # A tibble: 3 x 1
##   casualty_type
##   <chr>        
## 1 pedestrian   
## 2 cyclist      
## 3 cat

It should be clear what happened: filter() returns only rows that match the criteria in the function call, only observations with a casualty_age greater than 50 in this case. Likewise, select() returns data objects that include only columns named inside the function call, casualty_type in this case.

To gain a greater understanding of the functions, type and run the following commands, which also illustrate how the %>% can be used more than once to manipulate data (more on this soon):

crashes_darkness = crashes %>% filter(dark)
crashes_a = crashes %>% select(contains("a"))
crashes_darkness_a = crashes %>% 
  filter(dark) %>% 
  select(contains("a"))

Can you guess what the dimensions of the resulting objects will be? Write down your guesses for the number of rows and number of columns that the new objects, crashes_darkness to crashes_darkness_a, have before running the following commands to find out. This also demonstrates the handy function dim(), short for dimension (results not shown):16

dim(crashes)
dim(crashes_darkness)
?contains # get help on contains() to help guess the output of the next line
dim(crashes_a)
dim(crashes_darkness_a)

Look at the help pages associated with filter(), select() and the related function slice() as follows and try running the examples that you will find at the bottom of the help pages for each to gain a greater understanding (note you can use the package::function notation to get help on functions also):

?dplyr::filter
?dplyr::select
?dplyr::slice

5.3 Ordering and selecting the ‘top n’

Other useful pipe-friendly functions are arrange() and top_n(). arrange() can be used to sort data. Within. the arrage() function, optional arguments can be used to define the order in which it is sorted. top_n()simply selects the top ‘n’ number of rows in your data frame. We can use these functions to arrange datasets and take the top most ‘n’ values, as follows:

crashes %>% 
  arrange(vehicle_type)
## # A tibble: 3 x 4
##   casualty_type casualty_age vehicle_type dark 
##   <chr>                <dbl> <chr>        <lgl>
## 1 cyclist                 40 bus          FALSE
## 2 pedestrian              20 car          TRUE 
## 3 cat                     60 tank         TRUE
crashes %>% 
  top_n(n = 1, wt = casualty_age)
## # A tibble: 1 x 4
##   casualty_type casualty_age vehicle_type dark 
##   <chr>                <dbl> <chr>        <lgl>
## 1 cat                     60 tank         TRUE

5.4 Summarise

A powerful two-function combination is group_by() and summarise(). Used together, they can provide grouped summaries of datasets. In the example below, we find the mean age of casualties in dark and light conditions.

crashes %>%
  group_by(dark) %>% 
  summarise(mean_age = mean(casualty_age))
## # A tibble: 2 x 2
##   dark  mean_age
## * <lgl>    <dbl>
## 1 FALSE       40
## 2 TRUE        40

The example above shows a powerful feature of these pipelines. Many operations can be ‘chained’ together, whilst keeping readability with subsequent commands stacked below earlier operations. The combination of group_by() and summarise() can be very useful in preparing data for visualisation with a ggplot2 function. Another useful feature of the tidyverse from a user perspective is the autocompletion of column names mid pipe. If you have not noticed this already, you can test it by typing the following, putting your cursor just before the ) and pressing Tab:

crashes %>% select(ca) # press Tab when your cursor is just after the a

You should see casualty_age and casualty_type pop up as options that can be selected by pressing Up and Down. This may not seem like much, but when analysing large datasets with dozens of variables, it can be a godsend.

Rather than providing a comprehensive introduction to the tidyverse suite of packages, this section should have offered enough to get started with using it for road safety data analysis. For further information, check out up-to-date online courses from respected organisations like Data Carpentry and the free online books such as R for Data Science (Grolemund and Wickham 2016).

5.5 Tidyverse exercises

  1. Use dplyr to filter rows in which casualty_age is less than 18, and then 28.
  2. Use the arrange function to sort the crashes object in descending order of age (Hint: see the ?arrange help page).
  3. Read the help page of dplyr::mutate(). What does the function do?
  4. Use the mutate function to create a new variable, birth_year, in the crashes data.frame which is defined as the current year minus their age.
  5. Bonus: Use the %>% operator to filter the output from the previous exercise so that only observations with birth_year after 1969 are returned.

  1. Run the command help.start() to see a resources introducing base R, and Chapter 6 on lists and data frames in An Introduction to R in particular for an introduction to data manipulation with base R.↩︎

  2. Note that the number of rows is reported before the number of columns. This is a feature of R: rows are also specificiefied first when subsetting using the square brackets in commands such as crashes[1, 2:3].↩︎