Thus far, we have been working primarily with ‘accident’ level data, but there is much useful data in other tables. As outlined in the stats19 vignette — which you can view by entering the command vignette("stats19") to get extended help pages about R packages — there are three main tables that contain STATS19 data.
Let’s read-in data from 2023 to take a look:
library(stats19)library(dplyr)library(ggplot2)ac =get_stats19(year =2023, type ="acciden")ca =get_stats19(year =2023, type ="casualt")
Warning in asMethod(object): NAs introduced by coercion
ve =get_stats19(year =2023, type ="vehicle")
Warning: The following named parsers don't match the column names: accident_severity, carriageway_hazards, collision_index, collision_reference, collision_year, date, day_of_week, did_police_officer_attend_scene_of_accident, did_police_officer_attend_scene_of_collision, enhanced_collision_severity, first_road_class, first_road_number, junction_control, junction_detail, latitude, legacy_collision_severity, light_conditions, local_authority_district, local_authority_highway, local_authority_ons_district, location_easting_osgr, location_northing_osgr, longitude, lsoa_of_accident_location, lsoa_of_collision_location, number_of_casualties, number_of_vehicles, pedestrian_crossing_human_control, pedestrian_crossing_physical_facilities, police_force, road_surface_conditions, road_type, second_road_class, second_road_number, special_conditions_at_site, speed_limit, time, trunk_road_flag, urban_or_rural_area, weather_conditions, age_band_of_casualty, age_of_casualty, bus_or_coach_passenger, car_passenger, casualty_class, casualty_distance_banding, casualty_home_area_type, casualty_imd_decile, casualty_reference, casualty_severity, casualty_type, enhanced_casualty_severity, lsoa_of_casualty, pedestrian_location, pedestrian_movement, pedestrian_road_maintenance_worker, sex_of_casualty, adjusted_serious, adjusted_slight, injury_based, accident_ref_no, effective_date_of_change, previously_published_value, replacement_value, variable
NAs introduced by coercion
NAs introduced by coercion
The three objects read-in above correspond to the main types of entity that are recorded by the police:
Crashes: The ‘crash event’ table contains general data about crashes, including where and when they happened and the conditions in which the crash occurred (e.g. light levels in the column light_conditions in the ac object). For historical reasons, crash level data is stored in tables called ‘Accidents’ (a term that has fallen out of favour because it implies that nobody was at fault). See names for all 33 variables in the crashes table by running the command names(ac). Crashes range from collisions involving only one vehicle and another entity (e.g. a person on foot, bicycle or a car) causing only ‘slight’ injuries such as a graze, to multi-vehicle pile-ups involving multiple deaths and dozens of slight and serious injuries.
Casualties: The casualties table, assigned to an object called ca in the code above, contains data at the casualty level. As you will see by running the command names(ca), the STATS19 casualties table has 16 variables including age_of_casualty, casualty_severity and casualty_type, reporting the mode of transport in which the person was travelling when they were hit.
Vehicles: The vehicles table, assigned to ve above, contains information about the vehicles and their drivers involved in each collision. As you will see by running the command names(ve), the 23 variables in this table includes vehicle_type, hit_object_off_carriageway and first_point_of_impact. Information about the driver of vehicles involved is contained in variables such as age_of_driver, engine_capacity_cc and age_of_vehicle.
Each table represents the same phenomena: road casualties in Great Britain in 2023. Therefore, you may expect they would have the same number of rows, but this is not the case:
nrow(ac)
[1] 104258
nrow(ca)
[1] 132977
nrow(ve)
[1] 189815
The reason for this is that there are, on average, more than one casualty per crash (e.g. when a car hits two people), and more than one vehicle, including bicycles, per crash. We can find the average number of casualties and vehicles per crash as follows:
nrow(ca) /nrow(ac)
[1] 1.275461
nrow(ve) /nrow(ac)
[1] 1.820628
The output of the commands above show that there are around 1.3 casualties and 1.8 vehicles involved in each crash record in the STATS19 dataset for 2023. Each table contains a different number of columns, reporting the characteristics of each casualty and each driver/vehicle for the ca and ve datasets respectively.
ncol(ac)
[1] 38
ncol(ca)
[1] 21
ncol(ve)
[1] 34
The output of the previous code chunk shows that we have more variables in the ‘accidents’ table than the others but the others, but the other tables are data rich with 16 columns on the casualties and 23 on the vehicles. To check that the datasets are consistent, we can check that the number of casualties reported in the crashes table is equal to the number of rows in the casualties table, and the same for the vehicles table:
To join casualty (or vehicle) data onto the ac object above, the inner_join() function from dplyr can be used as follows:
ac_cas_joined =inner_join(ac, ca)
Joining with `by = join_by(accident_index, accident_year, accident_reference)`
The above command worked because the two datasets have a shared variable name: accident_index. Note that the command worked by duplicating accident records for multiple casualties. We can see this finding the accident that had the most crashes and printing the results in the ac and new joined dataset, as follows:
id_with_most_crashes = ac %>%top_n(n =1, wt = number_of_casualties) %>%pull(accident_index)id_with_most_crashes
[1] "2023520300610"
ac %>%filter(accident_index == id_with_most_crashes) %>%select(accident_index, accident_severity, number_of_vehicles, number_of_casualties)
The same approach can be used to join vehicle data onto the crash record data:
ac_veh_joined =inner_join(ac, ve)
Joining with `by = join_by(accident_index, accident_year, accident_reference)`
This information can be used as the basis of who-hit-who visualisation, in this case looking at vehicles involved in the most common type of casualties (see the trafficalmr package for functions to recode stats19 data):
casualty_types_df = ac_cas_joined |>count(casualty_type) |>arrange(desc(n))# top_casualty_types# 1 Car occupant 71145# 2 Pedestrian 19263# 3 Cyclist 14999# 4 Motorcycle 125cc and under rider or passenger 9115# 5 Motorcycle over 500cc rider or passenger 4385# 6 Van / Goods vehicle (3.5 tonnes mgw or under) occupant 3815# 7 Bus or coach occupant (17 or more pass seats) 2399# 8 Motorcycle over 125cc and up to 500cc rider or passenger 1924# 9 Other vehicle occupant 1531# 10 Taxi/Private hire car occupant 1530recode_casualties =function(x) { x_updated = dplyr::case_when( stringr::str_detect(x, "Moto") ~"Motorcyclist", stringr::str_detect(x, "Car|Pedestrian|Cyclist") ~ x,TRUE~"Other" )}ac_cas_joined$cas_type =recode_casualties(ac_cas_joined$casualty_type)barplot(table(ac_cas_joined$cas_type))
Barplot of recoded casualty type frequencies
To find the largest vehicle involved in each casualty, we can similarly pre-process the vehicle data as follows:
vehicle_types_df = ac_veh_joined |>count(vehicle_type) |>arrange(desc(n))# 1 Car 130164# 2 Pedal cycle 15667# 3 Van / Goods 3.5 tonnes mgw or under 11727# 4 Motorcycle 125cc and under 9710# 5 Motorcycle over 500cc 4473# 6 Bus or coach (17 or more pass seats) 3171# 7 Taxi/Private hire car 3073# 8 Goods 7.5 tonnes mgw and over 2647# 9 Other vehicle 2200# 10 Motorcycle over 125cc and up to 500cc 1975recode_vehicles =function(x) { x_updated = dplyr::case_when( stringr::str_detect(x, "Pedal") ~"Bicycle", stringr::str_detect(x, "Car") ~"Car", stringr::str_detect(x, "Van") ~"Van", stringr::str_detect(x, "HGV|over") ~"HGV",TRUE~"Other" )return(x_updated)}ac_veh_joined$veh_type =recode_vehicles(ac_veh_joined$vehicle_type)levels =c("Bicycle", "Car", "Other", "Van", "HGV")ac_veh_joined$vehicle =factor(ac_veh_joined$veh_type, levels = levels, ordered =TRUE)summary(ac_veh_joined$vehicle)
Bicycle Car Other Van HGV
15667 130164 22371 11727 9886
cas_veh_table =table(ac_cas_veh_largest$cas_type, ac_cas_veh_largest$largest_vehicle)cvt_df =as.data.frame(cas_veh_table)ggplot(cvt_df) +geom_bar(aes(Var2, Freq, fill = Var1), stat ="identity") +scale_fill_discrete("Casualty type") +xlab("Largest vehicle involved") +ylab("Number of casualties")
Figure 8.1: ‘Who hit who’ visualisation of number of casualties (y axis) hurt in crashes involving different vehicle types (largest vehicle in each crash on Y axis).
8.4 Case study: London
The three main tables we have just read-in can be joined by the accident_index variable and then filtered using other variables. This is demonstrated in the code chunk below, which subsets all casualties that took place in London, and counts the number of casualties by severity for each crash:
What just happened? We found the subset of casualties that took place in London with reference to the accident_index variable. Then we used the dplyr function, summarise(), to find the number of people who were in a car, cycling, and walking when they were injured. This new casualty dataset is joined onto the crashes_lnd dataset. The result is a spatial (sf) data frame of ac in London, with columns counting how many road users of different types were hurt. The joined data has additional variables:
base::setdiff(names(cj), names(ac_lnd))
[1] "Total" "walking" "cycling" "passenger"
As a simple spatial plot, we can map all crashes that occurred in London in 2023, with the colour related to the total number of people hurt in each crash. Placing this plot next to a map of London provides context:
The spatial distribution of crashes in London clearly relates to the region’s geography. Car crashes tend to happen on fast roads, including busy dual carriageway roads, displayed in yellow in Figure 8.2 above. Cycling is as an urban activity, and the most bike crashes can be found in or near the centre of London, which has a comparatively high level of cycling (compared to the low baseline of 3%).
In addition to the Total number of people hurt/killed, cj contains a column for each type of casualty (cyclist, car occupant, etc.), and a number corresponding to casualties in crashes involving each type of vehicle. It also contains the geometry column from ac_sf. In other words, joins allow the casualties and vehicles tables to be geo-referenced. We can then explore the spatial distribution of different casualty types. For example, Figure 8.3 shows the spatial distribution of pedestrians and car passengers hurt in car crashes across London in 2023, via the following code:
Figure 8.2: Spatial distribution of serious and fatal crashes in London, for cycling, walking, being a car passenger and other modes of travel. Colour is related to the speed limit where the crash happened (red is faster) and size is proportional to the total number of people hurt in each crash (legend not shown).
Exercises:
There is a lot going on in the code in this chapter, the most advanced of the guide. With reference to online help, work through the code line-by-line and look-up any aspects of the code that you do not fully understand to help figure out what is going on.
Reproduce the final figures for a different city of your choice (not London).
Bonus: Create more attractive interactive maps to show the spatial distribution of different casualty types in the city of your choice.