Introduction to transport data science

Module: Transport Data Science

Robin Lovelace

2025-06-10

Who: Transport Data Science team

Robin Lovelace

Associate Professor of Transport Data Science
Researching transport futures and active travel planning
R developer and teacher, author of Geocomputation with R

Yuanxuan Yang

Lecturer in Data Science of Transport
New and Emerging Forms of Data: Investigating novel data sources and their applications in urban mobility and transport planning.

TDS Team II

Malcolm Morgan

Senior researcher at ITS with expertise in routing + web
Developer of the Propensity to Cycle Tool and PBCC

Zhao Wang

Civil Engineer and Data Scientist with expertise in machine learning

Demonstrators

You!

What is transport data science?

The application of data science to transport datasets and problems
Raising the question…
What is data science?
A discipline “that allows you to turn raw data into understanding, insight, and knowledge” (Grolemund, 2016)

In other words…

Statistics that is actually useful!

Why take Transport Data Science

New skills (cutting edge R and/or Python packages)
Potential for impacts
Allows you to do new things with data
It might get you a job!

Example

Data science spin-out company: ImpactML

Data science employability

The Bureau of Labor Statistics in the US projects a 35% increase in data science roles in decade 2022-2032.” Source: visualisecurious.com

Live demo: npt.scot web app

The history of TDS

2017: Transport Data Science created, led by Dr Charles Fox, Computer Scientist, author of Transport Data Science book (Fox, 2018)
The focus was on databases and Bayesian methods
2019: I inherited the module, which was attended by ITS students
Summer 2019: Python code published in the module ‘repo’:
- github.com/ITSLeeds

History of TDS II

January 2020: Available, Data Science MSc course
March 2020: Switch to online teaching
2021-2023: Updated module, focus on methods
2024: Switch to combined practical sessions and lectures
2025+: Expand, online course? book? stay in touch!

Milestone passed in my academic career, first online-only delivery of lecture (ITSLeeds?), seems to have worked, live code demo with #rstats/rstudio, recording, chat + all🎉

Thanks students for ‘attending’ + remote participation, we’ll get through this together.#coronavirus pic.twitter.com/wlAUxmZj5r

— Robin Lovelace March 17, 2020

Reading list

See the reading list for details

Objectives

Understand the structure of transport datasets
Understand how to obtain, clean and store transport related data
Gain proficiency in command-line tools for handling large transport datasets
Produce data visualizations, static and interactive
Learn how to join together the components of transport data science into a cohesive project portfolio

Assessment (for those doing this as credit-bearing)

You will build-up a portfolio of work
100% coursework assessed, you will submit by
Written in code - will be graded for reproducibility
Code chunks and figures are encouraged
You will submit a non-assessed 2 page pdf + qmd

Feedback

The module is taught by two really well organised and enthusiastic professors, great module, the seminars, structured and unstructured learning was great and well thought out, all came together well

I wish this module was 60 credits instead of 15 because i just want more of it.

Timetable

See the schedule for details

What is science?

Scientific knowledge is hypotheses that can be falsified
Science is the process of generating falsifiable hypotheses and testing them
In a reproducible way
Systematically

Falsifiability is central to the scientific process (Popper 1959)
All of which requires software conducive to reproducibility

Transport planning software

Transport modelling software products are a vital component of modern transport planning and research.

They generate the evidence base on which strategic investments are made and, furthermore,
provide a powerful mechanism for researching alternative futures.

It would not be an overstatement to say that software determines the range of futures that are visible to policymakers. This makes status of transport modelling software and how it may evolve in the future important questions.

What will transport software look like? What will their capabilities be? And who will control? Answers to each of these questions will affect the future of transport systems.

Premise: transport planning/modelling software used in practice ~~will become~~ is becoming increasingly data-driven, modular and open.

Current transport software

4-stage model still dominates transport planning models (Boyce and Williams 2015)

The four stage model

Impacts the current software landscape
Dominated by a few proprietary products
Limited support community online
High degree of lock-in
Limited cross-department collaboration

Existing products

Sample of transport modelling software in use by practitioners.

Software	Company/Developer	Company HQ	Licence	Citations
Visum	PTV	Germany	Proprietary	1810
MATSim	TU Berlin	Germany	Open source (GPL)	1470
TransCAD	Caliper	USA	Proprietary	1360
SUMO	DLR	Germany	Open source (EPL)	1310
Emme	INRO	Canada	Proprietary	780
Cube	Citilabs	USA	Proprietary	400
sDNA	Cardiff University	UK	Open source (GPL)	170

User support

Getting help is vital for leaning/improving software

“10-Hour Service Pack $2,000” (source: caliper.com/tcprice.htm)

Online communities

gis.stackexchange.com has 21,314 questions
r-sig-geo has 1000s of posts
RStudio’s Discourse community has 65,000+ posts already!
No clear transport equivalent (e.g. earthscience.stackexchange.com is in beta)
Solution: build our own community!
- See https://github.com/ITSLeeds/TDS/issues for example
- Place for discussions: https://github.com/itsleeds/tds/discussions

Best way to get support is peer-to-peer:

Source: https://community.rstudio.com/about

How is data science used in the PCT?

It’s all reproducible, e.g.:
Find commuting desire lines in West Yorkshire between 1 and 3 km long in which more people drive than cycle:

Visualising data

A fundamental part of data science is being able to understand your data.

That requires visualisation, R is great for that:

Interactively

Processing data with code

Now we have data in our computer, and verified it works, we can use it
Which places are most car dependent?

Checking the results:

R vs Python

Lots of debate on this topic - see https://blog.usejournal.com/python-vs-and-r-for-data-science-833b48ccc91d

How to decide?

If priority: getting things done quick (with support from me ;) go with R
If you already know Python and are 100% confident you can generate reproducible results, go with that
If you want to be avant-garde and try something else like Julia, do it (as long as it’s reproducible)

Gamification

Completely open source, written in rust
Source: video at https://github.com/dabreegster/abstreet/#ab-street

Summary

Walk and understand the data before doing complex things
Visualise the data, ask questions of it, descriptive stats
Only then add complexity to your analysis
Starting point for this: Transport chapter of Geocomputation with R (Lovelace, Nowosad, and Münchow 2025)

Practical session

References

Lovelace, Robin, Jakub Nowosad, and Jannes Münchow. 2025. Geocomputation with R. CRC Press. https://r.geocompx.org/.

Popper, Karl. 1959. The Logic of Scientific Discovery. Hutchinson. http://books.google.com/books?id=MdvaSAAACAAJ&pgis=1.