7 Spatial data
From displaying simple point data to examining collision density along routes or between areas, the geographic property of STATS19 data is one of its most useful attributes. Mapping is a hugely useful and powerful aspect of R and has many applications in road safety, both in understanding geographic trends and presenting insight to colleagues. This aspect of R is covered in detail in the book Geocomputation With R (Lovelace, Nowosad, and Muenchow 2019). By mapping collision data in R, you can add layers containing other geographic datasets to further understand the reasons for certain trends. This can lead to new opportunities for intervention and collaboration with other parties who may have mutually compatible solutions for reaching their goals. We will use the following packages in this section:
library(sf) # load the sf package for working with spatial data
library(tidyverse) # load the tidyverse as before
7.1 sf objects
All road crashes happen somewhere and, in the UK at least, all collisions recorded by the police are given geographic coordinates. These can help in prioritising interventions to save lives by focusing on ‘crash hotspots.’
R has strong geographic data capabilities with the sf
package, providing a generic class for spatial vector data. Points, lines and polygons are represented in sf
as objects in a special ‘geometry column’, typically called ‘geom’ or ‘geometry’, extending the data frame class we’ve already seen in crashes
, created in Section 5.1 (repeated here to consolidate data frame creation):
crashes = tibble(
casualty_type = c("pedestrian", "cyclist", "cat"),
casualty_age = seq(from = 20, to = 60, by = 20),
vehicle_type = c("car", "bus", "tank"),
dark = c(TRUE, FALSE, TRUE)
)
Create an sf
data frame called crashes_sf
that expands the crashes
data frame to include a geometry column based on the crashes
longitude and latitude data as follows:
crashes_sf = crashes # create copy of crashes dataset
crashes_sf$longitude = c(-1.3, -1.2, -1.1)
crashes_sf$latitude = c(50.7, 50.7, 50.68)
crashes_sf = st_as_sf(crashes_sf, coords = c("longitude", "latitude"), crs = 4326) # st_as_sf converts longitude and latitude coordinates into spatial objects using a specified Coordinate Reference System (see 7.6)
# plot(crashes_sf[1:4]) # basic plot
# mapview::mapview(crashes_sf) # for interactive map
Exercises:
- Plot only the geometry column of
crashes_sf
(Hint: the solution may contain$geometry
). If the result is like that in Figure 7.1, congratulations, it worked! - Plot
crashes_sf
, only showing the age variable. - Plot the 2nd and 3rd crashes, showing which happened in the dark.
- Bonus: How far apart are the points? (Hint:
sf
functions begin withst_
) - Bonus: Near which settlement did the tank runover the cat?
7.2 Reading and writing spatial data
You can read and write spatial data with read_sf()
and write_sf()
, as shown below (see ?read_sf
):
write_sf(zones, "zones.geojson") # saves the spatial dataset called zones as geojson file type
write_sf(zones, "zmapinfo", driver = "MapInfo file") # saves the dataset as a MapInfo file
read_sf("zmapinfo") # reads-in mapinfo file
See Chapter 6 of Geocomputation with R for further information (Lovelace, Nowosad, and Muenchow 2019).
7.3 sf polygons
sf
objects can also represent administrative zones.
This is illustrated below with reference to zones
, a spatial object representing the Isle of Wight, that we will download using the pct
package (Note: the [1:9]
appended to the function selects only the first 9 columns).
Exercises:
- What is the class of the
zones
object? - What are its column names?
- Print its first 2 rows and columns 6:8 (the result is below).
## Simple feature collection with 2 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -1.301131 ymin: 50.69052 xmax: -1.28837 ymax: 50.70547
## geographic CRS: WGS 84
## # A tibble: 2 x 6
## geo_code all bicycle foot car_driver geometry
## <chr> <dbl> <dbl> <dbl> <dbl> <MULTIPOLYGON [°]>
## 1 E01017326 698 23 285 286 (((-1.289993 50.69766, -1.290177 50.…
## 2 E01017327 720 25 225 374 (((-1.295712 50.69383, -1.29873 50.6…
7.4 Spatial subsetting and sf plotting
Like index and value subsetting, spatial subsetting can be done with the []
notation.
We can identify the crashes (crashes_sf
) that occur in the Isle of Wight (zones
) by subsetting as follows (i.e. subset zones
by whether it contains data in crashes_sf
):
To plot a new layer on top of an existing sf
plot, use the add = TRUE
argument, e.g. as follows:
plot(zones$geometry) # plot just the geometry of one layer
plot(zones_containing_crashes$geometry, col = "grey", add = TRUE)
Remember to plot only the geometry
column of objects to avoid multiple maps.
Colours can be set with the col
argument.
Exercises:
- Plot the geometry of the zones, with the zones containing crashes overlaid on top in red (see Figure 7.2).
- Plot the zone containing the 2nd crash in blue (see Figure 7.2).
- Bonus: Plot all zones that intersect with a zone containing crashes, with the actual crash points plotted in black (see Figure 7.2).
7.5 Geographic joins
Geographic joins involve assigning values from one object to a new column in another, based on the geographic relationship between them.
With sf
objects, the data from the crashes_sf
dataset is joined onto the ‘target’ zones
dataset, to create a new object called zones_joined
:
The above code takes the geo_code
column data from zones
, matches it to the geometry
column in crashes_sf
and then joins it to the crashes that have occurred in those geo_codes. The matched, joined geo_code
is a new column in the zone_joined
dataset. We now know the administrative geo_code in which each crash occured.
Exercises:
- Plot the
casualty_age
variable of the newzones_joined
object (see Figure 7.3, to verify the result). - How many zones are returned in the previous command?
- Select only the
geo_code
column from thezones
and thedark
column fromcrashes_sf
and use theleft = FALSE
argument to return only zones in which crashes occurred. Plot the result. (Hint: it should look like the map shown in Figure 7.3)
See Chapter 4 of Geocomputation with R (Lovelace, Nowosad, and Muenchow 2019) for further information on geographic joins.
7.6 Coordinate Reference Systems
A Coordinate Reference Systems (CRS) is used for plotting data on maps. There are many systems in use but they can generally be classified into two groups; ‘projected’ and ‘geographic’. A projected system, such as Eastings/Northings, plots locations onto a flat 2D projection of the Earth’s surface. A geographic system, such as Longitude/Latitude, refers to locations on the 3D surface of the globe. Distance and direction calculations work differently between geographic and projected CRSs, so it is often necessary to convert from one to another. Fortunately, R makes this very easy, and every CRS has its own unique reference number. For example, 27700 for the British National Grid system.
CRSs define how two-dimensional points (such as longitude and latitude) are actually represented in the real world. A CRS value is needed to interpret and give true meaning to coordinates. You can get and set CRSs with the command st_crs()
.
Transform CRSs with the command st_transform()
, as demonstrated in the code chunk below, which converts the ‘lon/lat’ geographic CRS of crashes_sf
into the projected CRS of the British National Grid:
Exercises:
- Try to subset the zones with the
crashes_osgb
. What does the error message say? - Create
zones_osgb
by transforming thezones
object. - Bonus: Use
st_crs()
to find out the units measurement of the British National Grid.
For more information on CRSs see Chapter 6 of Geocompuation with R (Lovelace, Nowosad, and Muenchow 2019).
7.7 Buffers
Buffers are polygons surrounding geometries, usually with fixed distance.
For example, in road safety research a 30m buffer can be created around crash locations to identify crashes that happened in close proximity to a particular junction or road segment.
Exercises:
- Find out and read the help page of
sf
’s buffer function. - Create an object called
crashes_1km_buffer
representing the area within 1 km of thecrashes_osgb
object and plot the result using the command:plot(crashes_1km_buffer)
. As a fun bonus, try:mapview::mapview(crashes_1km_buffer)
. - Bonus: Try creating buffers on the geographic version of the
crashes_sf
object. What happens?
7.8 Attribute operations on sf objects
We can do non-spatial operations on sf
objects because they are data.frame
s.
Try the following attribute operations on the zones
data:
# load example dataset if it doesn't already exist
zones = pct::get_pct_zones("isle-of-wight")
sel = zones$all > 3000 # create a subsetting object
zones_large = zones[sel, ] # subset areas with a population over 3,000
zones_2 = zones[zones$geo_name == "Isle of Wight 002",] # subset based on 'equality' query
zones_first_and_third_column = zones[c(1, 3)]
zones_just_all = zones["all"]
Exercises:
- Practice the subsetting techniques you have learned on the
sf data.frame
objectzones
:- Create an object called
zones_small
, which contains only regions with less than 3000 people in theall
column. - Create a selection object called
sel_high_car
which isTRUE
for regions with above median numbers of people who travel by car andFALSE
otherwise. - Create an object called
zones_foot
which contains only the foot attribute fromzones
. - Bonus 1: plot
zones_foot
using the functionplot
to show where walking is a popular mode of travel to work. - Bonus 2: building on your answers to previous questions, use
filter()
from thedplyr
package to subset small regions where car use is high.
- Create an object called
- Bonus: What is the population density of each region (Hint: you may need to use the functions
st_area()
,as.numeric()
and use the ‘all’ column)? - Bonus: Which zone has the highest percentage of people who cycle?
7.9 Mapping road crash data
So far we have used the plot()
function to make maps.
That’s fine for basic visualisation, but for publication-quality maps we recommend using tmap
. See Chapter 8 of Geocomputation with R (Lovelace, Nowosad, and Muenchow 2019) for further explanation and alternatives.
After installation, load the package as follows:
## tmap mode set to plotting
Exercises:
- Create the plots of the
zones
object usingplot()
andtm_shape() + tm_polygons()
functions (see Figure 7.4). - Create an interactive version of the
tmap
plot by settingtmap_mode("view")
and re-running the plotting commands. - Add an additional layer to the interactive map showing the location of crashes, using marker and dot symbols.
- Bonus: Change the default basemap (Hint: you may need to search in the package documentation or online for the solution).
7.10 Analysing point data
Based on the saying, “Don’t run before you can walk,” we’ve learned the vital foundations of R before tackling a real dataset.
Temporal and spatial attributes are key to road crash data, hence the emphasis on lubridate
and sf
.
Visualisation is central to understanding data and influencing policy, which is where tmap
comes in.
With these solid foundations, plus knowledge of how to ask for help (we recommend reading R’s internal help, asking colleagues, searching the internet and creating new questions or comments on online forums such as StackOverflow or GitHub and we suggest you follow this order of resources to get help), you are ready to test the methods on some real data.
Before doing so, take a read of the stats19
vignette, which can be launched as follows:
vignette(package = "stats19") # view all vignettes available on stats19
vignette("stats19") # view the introductory vignette
This should now be sufficient to tackle the following exercises:
- Download and plot all crashes reported in Great Britain in 2018. (Hint: see the stats19 vignette)
- Find the function in the
stats19
package that converts adata.frame
object into ansf
data frame. Use this function to convert the road crashes into ansf
object, calledcrashes_sf
, for example. - Filter crashes that happened in the Isle of Wight based on attribute data. (Hint: the relevant column contains the word
local
) - Filter crashes happened in the Isle of Wight using geographic subsetting. (Hint: remember
st_crs()
?) - Bonus: Which type of subsetting yielded more results and why?
- Bonus: How many crashes happened in each zone?
- Create a new column called
month
in the crash data using the functionlubridate::month()
and thedate
column. - Create an object called
a_zones_may
representing all the crashes that happened in the Isle of Wight in the month of May. - Bonus: Calculate the average (
mean
) speed limit associated with each crash that happened in May across the zones of the Isle of Wight (the result is shown in the map).
Other mapping functions that can be powerful presentation tools are also explained in Geocomputation With R (Lovelace, Nowosad, and Muenchow 2019). These include leaflet maps, which have interactive properties and can be shared as .html files, and shiny apps that give even more interactivity as well as the possibility to embed on websites.
7.11 Analysing crash data on road networks
Road network data can be accessed from a range of sources, including OpenStreetMap (OSM) and Ordnance Survey. We will use some OSM data from the Isle of Wight, which can be loaded as follows:
u = "https://github.com/ropensci/stats19/releases/download/1.1.0/roads_key.Rds"
roads_wgs = readRDS(url(u))
roads = roads_wgs %>% st_transform(crs = 27700)
You should already have road crashes for the Isle of Wight from the previous stage. If not, load crash data as follows:
u = "https://github.com/ropensci/stats19/releases/download/1.1.0/car_accidents_2017_iow.Rds"
crashes_iow = readRDS(url(u))
- Plot the roads with the crashes overlaid.
- Create a buffer around the roads with a distance of 200m.
- How many crashes fall outside the buffered roads?
- Bonus: Use the
aggregate()
function to identify how many crashes happened per segment and plot the result withtmap
and plot the crashes that happened outside the road buffers on top, as shown in Figure 7.5 (hint: see?aggregate.sf
and take a read of Section 4.2.5 of Geocomputation with R (Lovelace, Nowosad, and Muenchow 2019)). Try undertaking the same steps for a different region of your choice. An example for Essex is shown in the code chunk, which results in Figure 7.6, (thanks to Will Cubbin).17
remotes::install_github("itsleeds/osmextract") # install github package for osm data
library(osmextract)
region_name = "essex" # "essex" can be changed to another area name as required
osm_data = oe_get(region_name, extra_tags = c("maxspeed", "ref"))
table(osm_data$highway)
#filter osm_data to show only major roads
roads = osm_data %>%
filter(str_detect(highway, pattern = "moto|prim|seco|tert|trunk"))
# transform geometry and save
roads = st_transform(roads, 27700) #converts to projected BNG system for later use
# plot(roads$geometry) # basic plot of roads
tm_shape(roads) + tm_lines("maxspeed", showNA = T, lwd = 2)
saveRDS(roads, file = "roads.Rds") # Saves road dataset for future use
Bonus exercises
Identify a region and zonal units of interest from http://geoportal.statistics.gov.uk/ or from the object police_boundaries
in the stats19
package.
- Read them into R as an
sf
object. - Create a map showing the number of crashes in each zone.
- Identify the average speed limit associated with crashes in each zone.
- Identify an interesting question you can ask to the data and use exploratory data analysis to find answers.
- Check another related project for further information on smoothing techniques of counts on a linear network.
The code chunk downloads and filters road data to show only main roads, and saves as an .Rds file for later use in your R project. The geometry of the
roads.Rds
file from this example is a one dimensional line representing the centre of the road. Some functions require the road to be treated as 2D polygon with a width property. You can use thest_buffer()
function to turn this vector into a polygon.↩︎