Session 3: AI for Transport Data Analysis

1 Introduction

In this session, we will explore how AI can be used to augment and accelerate data analysis tasks in transport planning. We will focus on using AI to write code for analyzing and visualizing transport data.

2 Finding and Cleaning Data with AI

Generally available AI models, such as GPT (currently used in M365 Copilot and the models you can use in GitHub Copilot) are generalists. That means they are not generally good at very specific domains, such as the locations of transport datasets.

You can check which model you are using by asking the AI directly.

Example conversation

User: What model are you?

Copilot: I’m based on GPT‑5, a large language model developed by OpenAI and integrated into Microsoft Copilot. This means I can handle a wide range of tasks—answering questions, generating content, analyzing data, and more—while adapting to your context.

To provide an example, imagine you need a dataset of the motorway network in the South East of England. You could try this prompt in an AI chat tool, as shown in the details below.

However, a more effective approach is likely to be to ask the AI to help you write code to search for and download relevant datasets from known repositories, such as data.gov.uk or OpenStreetMap.

2.1 Example: Downloading motorway data

Note: This example demonstrates where AI can be helpful but also where human intervention is crucial. AI-generated code often requires careful checking against documentation and may use outdated packages or file formats.

  1. Write a prompt that results in code to download motorway data in the South East of England.

  2. Update the prompt to make it more specific, e.g. telling it to use a specific language or package.

  3. Try running the code.

Example prompt and output

Prompt:

“Write R code to download motorway network data for the South East of England from OpenStreetMap using the osmdata package.”

AI Output:

library(osmdata)
library(sf)
# Define the bounding box for South East England
bbox <- getbb("South East England, UK")
# Create an Overpass query for motorways
motorway_query <- opq(bbox) %>%
  add_osm_feature(key = "highway", value = "motorway")
# Download the data
motorway_data <- osmdata_sf(motorway_query)
# Extract the motorway lines
motorways <- motorway_data$osm_lines
# Plot the result
plot(st_geometry(motorways), col = 'blue', main = "Motorway Network in South East England")
# Save to a shapefile
st_write(motorways, "south_east_motorways.shp")

Notes:

  1. It worked in one shot!

A year ago this would not have worked first time, but the models are improving rapidly.

  1. However, it was not perfect:
  • It saved the file in an out-of data file format (shapefile) rather than a more modern one (GeoPackage).
  • It used the outdated %>% pipe operator from magrittr rather than the native R pipe |>.
  1. Refining the prompt can easily fix these issues.

2.2 Example: Downloading road traffic collision data in TfSE

See the bonus section at the end of this page for an example of how to use R and AI to download and filter road traffic collision data for the Transport for the South East (TfSE) region.

2.3 Bonus activity

Use AI to help you find and import data on a topic of your choice relevant to transport planning.

3 Exploratory Data Analysis (EDA) with AI

Take a read of the AI-assisted EDA session in survey-prompts.md for ideas on how to use AI to speed up EDA tasks. See the resulting code and outputs in the survey page.

3.1 Exercises

Option 1: try to reproduce some of the outputs, and new outputs, that were generated in the survey analysis session.

Option 2: Try to download a specific transport dataset (e.g. the motorway network) using AI assistance, verifying the code and output carefully.

Option 3: use AI to help you explore the motorway data you downloaded earlier in this session.

Option 4: use AI to help you explore a transport dataset of your choice.

4 AI-Assisted Data Visualization

Data visualization is a key part of data analysis, but the syntax for creating visualizations can be complex and time-consuming to learn. AI can help by generating code snippets for visualizations based on natural language prompts.

4.1 Exercises

Continue with EDA tasks from the previous exercise, but add visualisations.

You can use packages of your choice, but we recommend you try the following:

  • For people using R: ggplot2 (part of the tidyverse), plotly (for interactive plots), or mapview (for spatial data).

  • For people using Python: matplotlib, seaborn, or plotly.

5 Interpreting Results with LLMs

Copy-paste some of the results from your EDA and visualizations into an AI chat tool, and ask it to help you interpret the results.

Do you agree with the AI’s interpretation?

Advanced: write data analysis code in a Quarto (.qmd) file and render it to Markdown with the following command:

quarto render your_file.qmd --to markdown

Then ask the AI to read the output (in this case called your_file.md) and help you interpret the results.

6 Bonus: Analysing road traffic collision data in TfSE

In this section, we demonstrate how to programmatically access and filter large national datasets using R.

library(stats19)
library(sf)
library(dplyr)

# 1. Load the study area boundary
# We use the pre-prepared TfSE boundary file
tfse_boundary = read_sf("tfse_boundary.geojson")

We can download road safety data directly from the Department for Transport using the stats19 package.

Example: Getting 5 years of data (Code only)

# This code downloads collision data for 2020-2024
years = 2020:2024
collisions_5y = purrr::map_dfr(years, ~ stats19::get_stats19(year = .x, type = "collision"))

Example: Joining Casualties (Code only)

# This code gets casualty data and joins it to the collisions
casualties_5y = purrr::map_dfr(years, ~ stats19::get_stats19(year = .x, type = "casualty"))

casualties_joined = dplyr::left_join(
  casualties_5y,
  collisions_5y,
  by = "collision_index" # Common ID field
)

Live Demo: Getting 2020 Collisions for TfSE

For this exercise, we will download just the 2020 collision data and filter it to our study area.

# 1. Get 2020 collision data
collisions_2020 = get_stats19(year = 2020, type = "collision")

# 2. Convert to an 'sf' object (spatial data)
collisions_sf = format_sf(collisions_2020, lonlat = TRUE)

# 3. Filter to the TfSE boundary
# Ensure Coordinate Reference Systems match
collisions_sf = st_transform(collisions_sf, st_crs(tfse_boundary))

# Spatial subset: keep collisions within the boundary
collisions_tfse = collisions_sf[tfse_boundary, ]

print(paste("Original 2020 collisions:", nrow(collisions_sf)))
print(paste("TfSE collisions:", nrow(collisions_tfse)))

# 4. Quick visual check
plot(st_geometry(collisions_tfse), pch = 19, cex = 0.1, col = "red", 
     main = "Road Collisions in TfSE (2020)")
plot(st_geometry(tfse_boundary), add = TRUE, border = "blue")

Reuse