Introduction to transport data science


Module: Transport Data Science

Robin Lovelace

2025-02-12

Who: Transport Data Science team

Robin Lovelace

  • Associate Professor of Transport Data Science
  • Researching transport futures and active travel planning
  • R developer and teacher, author of Geocomputation with R

Yuanxuan Yang

  • Lecturer in Data Science of Transport
  • New and Emerging Forms of Data: Investigating novel data sources and their applications in urban mobility and transport planning.

TDS Team II

Malcolm Morgan

  • Senior researcher at ITS with expertise in routing + web
  • Developer of the Propensity to Cycle Tool and PBCC

Zhao Wang

  • Civil Engineer and Data Scientist with expertise in machine learning

Demonstrators

You!

What is transport data science?

  • The application of data science to transport datasets and problems
  • Raising the question…
  • What is data science?
  • A discipline “that allows you to turn raw data into understanding, insight, and knowledge” (Grolemund, 2016)

In other words…

  • Statistics that is actually useful!

Why take Transport Data Science

  • New skills (cutting edge R and/or Python packages)
  • Potential for impacts
  • Allows you to do new things with data
  • It might get you a job!

Example

Data science spin-out company: ImpactML

Data science employability

The Bureau of Labor Statistics in the US projects a 35% increase in data science roles in decade 2022-2032.” Source: visualisecurious.com

Live demo: npt.scot web app

The history of TDS

  • 2017: Transport Data Science created, led by Dr Charles Fox, Computer Scientist, author of Transport Data Science book (Fox, 2018)

  • The focus was on databases and Bayesian methods

  • 2019: I inherited the module, which was attended by ITS students

  • Summer 2019: Python code published in the module ‘repo’:

History of TDS II

  • January 2020: Available, Data Science MSc course
  • March 2020: Switch to online teaching
  • 2021-2023: Updated module, focus on methods
  • 2024: Switch to combined practical sessions and lectures
  • 2025+: Expand, online course? book? stay in touch!

Reading list

See the reading list for details

Objectives

  • Understand the structure of transport datasets

  • Understand how to obtain, clean and store transport related data

  • Gain proficiency in command-line tools for handling large transport datasets

  • Produce data visualizations, static and interactive

  • Learn how to join together the components of transport data science into a cohesive project portfolio

Assessment (for those doing this as credit-bearing)

  • You will build-up a portfolio of work
  • 100% coursework assessed, you will submit by
  • Written in code - will be graded for reproducibility
  • Code chunks and figures are encouraged
  • You will submit a non-assessed 2 page pdf + qmd

Feedback

The module is taught by two really well organised and enthusiastic professors, great module, the seminars, structured and unstructured learning was great and well thought out, all came together well

I wish this module was 60 credits instead of 15 because i just want more of it.

Timetable

See the schedule for details

What is science?

  • Scientific knowledge is hypotheses that can be falsified
  • Science is the process of generating falsifiable hypotheses and testing them
  • In a reproducible way
  • Systematically

  • Falsifiability is central to the scientific process (Popper 1959)
  • All of which requires software conducive to reproducibility

Transport planning software

Transport modelling software products are a vital component of modern transport planning and research.

  • They generate the evidence base on which strategic investments are made and, furthermore,
  • provide a powerful mechanism for researching alternative futures.

It would not be an overstatement to say that software determines the range of futures that are visible to policymakers. This makes status of transport modelling software and how it may evolve in the future important questions.

What will transport software look like? What will their capabilities be? And who will control? Answers to each of these questions will affect the future of transport systems.

  • Premise: transport planning/modelling software used in practice will become is becoming increasingly data-driven, modular and open.

Current transport software

4-stage model still dominates transport planning models (Boyce and Williams 2015)

The four stage model

  • Impacts the current software landscape

  • Dominated by a few proprietary products

  • Limited support community online

  • High degree of lock-in

  • Limited cross-department collaboration

Existing products

Sample of transport modelling software in use by practitioners.

Software Company/Developer Company HQ Licence Citations
Visum PTV Germany Proprietary 1810
MATSim TU Berlin Germany Open source (GPL) 1470
TransCAD Caliper USA Proprietary 1360
SUMO DLR Germany Open source (EPL) 1310
Emme INRO Canada Proprietary 780
Cube Citilabs USA Proprietary 400
sDNA Cardiff University UK Open source (GPL) 170

User support

Getting help is vital for leaning/improving software

“10-Hour Service Pack $2,000” (source: caliper.com/tcprice.htm)

Online communities

  • gis.stackexchange.com has 21,314 questions
  • r-sig-geo has 1000s of posts
  • RStudio’s Discourse community has 65,000+ posts already!
  • No clear transport equivalent (e.g. earthscience.stackexchange.com is in beta)
  • Solution: build our own community!
    • See https://github.com/ITSLeeds/TDS/issues for example
    • Place for discussions: https://github.com/itsleeds/tds/discussions

Best way to get support is peer-to-peer:

Source: https://community.rstudio.com/about

How is data science used in the PCT?

  • It’s all reproducible, e.g.:
  • Find commuting desire lines in West Yorkshire between 1 and 3 km long in which more people drive than cycle:

Visualising data

A fundamental part of data science is being able to understand your data.

That requires visualisation, R is great for that:

Interactively

Processing data with code

  • Now we have data in our computer, and verified it works, we can use it

  • Which places are most car dependent?

Checking the results:

R vs Python

  • Lots of debate on this topic - see https://blog.usejournal.com/python-vs-and-r-for-data-science-833b48ccc91d

How to decide?

  • If priority: getting things done quick (with support from me ;) go with R
  • If you already know Python and are 100% confident you can generate reproducible results, go with that
  • If you want to be avant-garde and try something else like Julia, do it (as long as it’s reproducible)

Gamification

  • Completely open source, written in rust
  • Source: video at https://github.com/dabreegster/abstreet/#ab-street

Summary

  • Walk and understand the data before doing complex things
  • Visualise the data, ask questions of it, descriptive stats
  • Only then add complexity to your analysis
  • Starting point for this: Transport chapter of Geocomputation with R (Lovelace, Nowosad, and Münchow 2025)

Practical session

References

Lovelace, Robin, Jakub Nowosad, and Jannes Münchow. 2025. Geocomputation with R. CRC Press. https://r.geocompx.org/.
Popper, Karl. 1959. The Logic of Scientific Discovery. Hutchinson. http://books.google.com/books?id=MdvaSAAACAAJ&pgis=1.