5 Manipulating data
This section is an introduction to manipulating datasets using the dplyr
package.
As outlined in the previous section, dplyr
and ggplot2
are part of the tidyverse
, which aims to provide a user-friendly framework for data science (Grolemund and Wickham 2016).
Experience of teaching R over the past few years suggests that many people find it easier to get going with data driven research if they learn the ‘tidy’ workflow presented in this section.
However, if you do not like this style of R code or you are simply curious, we encourage you to try alternative approaches for achieving the similar results using base R (R Core Team 2020b)15
, the data.table
R package (Dowle and Srinivasan 2019) or other languages such as Python or Julia.
If you just want to get going with processing data, the tidyverse
is a solid and popular starting point.
Before diving into the tidyverse
, it is worth re-capping where we have got to so far as we have covered a lot of ground.
Section 2 introduced R’s basic syntax; Section 3 showed how to use the Source Editor and other features of RStudio to support data science; and Section 4 introduced the concept and practicalities of R packages, with reference to stats19
, ggplot2
and dplyr
.
In this section, we will start with a blank slate.
In Section 2 we learned that in R having a ‘clear desk’ means an empty global environment.
This can be achieved by running the following command, which removes the list()
of all objects returned by the function ls()
:
5.1 tibbles
Although the data processing techniques in R are capable of handling large datasets, such as the crashes_2019
object that we created in the previous section, representing 100k+ casualties, it makes sense to start small.
Let’s start by re-creating the crashes
dataset from Section 2, but this time using the tidyverse
tibble()
function. This is the tidyverse
equivalent of base R’s data.frame
.
tibble
objects can be created, after loading the tidyverse
, as follows:
library(tidyverse)
crashes = tibble(
casualty_type = c("pedestrian", "cyclist", "cat"),
casualty_age = seq(from = 20, to = 60, by = 20),
vehicle_type = c("car", "bus", "tank"),
dark = c(TRUE, FALSE, TRUE)
)
In the previous code chunk, we passed four vector objects as named arguments to the tibble
function, resulting in columns such as casualty_type
.
A tibble
is just a fancy way of representing data.frame
objects, preferred by tidyverse
users and optimised for data science.
It has a few sensible defaults and advantages compared with the data.frame
, one of which can be seen by printing a tibble
:
## [1] "tbl_df" "tbl" "data.frame"
## # A tibble: 3 x 4
## casualty_type casualty_age vehicle_type dark
## <chr> <dbl> <chr> <lgl>
## 1 pedestrian 20 car TRUE
## 2 cyclist 40 bus FALSE
## 3 cat 60 tank TRUE
Note the <chr>
, <dbl>
or <lgl>
text below each column, providing a quick indication of the class of each variable - this is not provided when using data.frame
.
5.2 filter() and select() rows and columns
In the previous section, we briefly introduced the package dplyr
, which provides an alternative to base R for manipulating objects. dplyr
provides different, and some would argue simpler, approaches for subsetting rows and columns than base R.
dplyr
operations for subsetting rows (with the function filter()
) and columns (with the function select()
) are demonstrated below. Here we can also see the use of the pipe operator %>%
to take the dataset and apply the function to that dataset.
## # A tibble: 1 x 4
## casualty_type casualty_age vehicle_type dark
## <chr> <dbl> <chr> <lgl>
## 1 cat 60 tank TRUE
## # A tibble: 3 x 1
## casualty_type
## <chr>
## 1 pedestrian
## 2 cyclist
## 3 cat
It should be clear what happened: filter()
returns only rows that match the criteria in the function call, only observations with a casualty_age
greater than 50 in this case.
Likewise, select()
returns data objects that include only columns named inside the function call, casualty_type
in this case.
To gain a greater understanding of the functions, type and run the following commands, which also illustrate how the %>%
can be used more than once to manipulate data (more on this soon):
crashes_darkness = crashes %>% filter(dark)
crashes_a = crashes %>% select(contains("a"))
crashes_darkness_a = crashes %>%
filter(dark) %>%
select(contains("a"))
Can you guess what the dimensions of the resulting objects will be?
Write down your guesses for the number of rows and number of columns that the new objects, crashes_darkness
to crashes_darkness_a
, have before running the following commands to find out. This also demonstrates the handy function dim()
, short for dimension (results not shown):16
dim(crashes)
dim(crashes_darkness)
?contains # get help on contains() to help guess the output of the next line
dim(crashes_a)
dim(crashes_darkness_a)
Look at the help pages associated with filter()
, select()
and the related function slice()
as follows and try running the examples that you will find at the bottom of the help pages for each to gain a greater understanding (note you can use the package::function
notation to get help on functions also):
5.3 Ordering and selecting the ‘top n’
Other useful pipe-friendly functions are arrange()
and top_n()
. arrange()
can be used to sort data. Within. the arrage()
function, optional arguments can be used to define the order in which it is sorted. top_n()
simply selects the top ‘n’ number of rows in your data frame.
We can use these functions to arrange datasets and take the top most ‘n’ values, as follows:
## # A tibble: 3 x 4
## casualty_type casualty_age vehicle_type dark
## <chr> <dbl> <chr> <lgl>
## 1 cyclist 40 bus FALSE
## 2 pedestrian 20 car TRUE
## 3 cat 60 tank TRUE
## # A tibble: 1 x 4
## casualty_type casualty_age vehicle_type dark
## <chr> <dbl> <chr> <lgl>
## 1 cat 60 tank TRUE
5.4 Summarise
A powerful two-function combination is group_by()
and summarise()
.
Used together, they can provide grouped summaries of datasets.
In the example below, we find the mean age of casualties in dark and light conditions.
## # A tibble: 2 x 2
## dark mean_age
## * <lgl> <dbl>
## 1 FALSE 40
## 2 TRUE 40
The example above shows a powerful feature of these pipelines. Many operations can be ‘chained’ together, whilst keeping readability with subsequent commands stacked below earlier operations. The combination of group_by()
and summarise()
can be very useful in preparing data for visualisation with a ggplot2
function.
Another useful feature of the tidyverse
from a user perspective is the autocompletion of column names mid pipe.
If you have not noticed this already, you can test it by typing the following, putting your cursor just before the )
and pressing Tab
:
You should see casualty_age
and casualty_type
pop up as options that can be selected by pressing Up
and Down
.
This may not seem like much, but when analysing large datasets with dozens of variables, it can be a godsend.
Rather than providing a comprehensive introduction to the tidyverse
suite of packages, this section should have offered enough to get started with using it for road safety data analysis.
For further information, check out up-to-date online courses from respected organisations like Data Carpentry and the free online books such as R for Data Science (Grolemund and Wickham 2016).
5.5 Tidyverse exercises
- Use
dplyr
to filter rows in whichcasualty_age
is less than 18, and then 28. - Use the
arrange
function to sort thecrashes
object in descending order of age (Hint: see the?arrange
help page). - Read the help page of
dplyr::mutate()
. What does the function do? - Use the mutate function to create a new variable,
birth_year
, in thecrashes
data.frame which is defined as the current year minus their age. - Bonus: Use the
%>%
operator to filter the output from the previous exercise so that only observations withbirth_year
after 1969 are returned.
Run the command
help.start()
to see a resources introducing base R, and Chapter 6 on lists and data frames in An Introduction to R in particular for an introduction to data manipulation with base R.↩︎Note that the number of rows is reported before the number of columns. This is a feature of R: rows are also specificiefied first when subsetting using the square brackets in commands such as
crashes[1, 2:3]
.↩︎