Transport Data Science

A module on using data science to solve transport problems.

1 R, Python or other?

While the focus of the module is on methods that can be implemented in many languages, we expect most people taking this course will use R for the practicals and for the end our course project. We use R because it is hard to beat in terms of a data science environment with batteries included, with many mature packages for data manipulation, visualisation, and statistical analysis available within seconds, without having to worry about package conflicts or managing environments. R is also the language with which the module team has the most experience.

Python is another excellent choice for transport data science, and many of the example code chunks we provide in R have been ported to Python examples, as illustrated below, which shows how to load the R package {sf} and the equivalent Python package {geopandas}.

library(sf)
geo_data = read_sf("geo_data.gpkg")
import geopandas as gpd
geo_data = gpd.read_file("geo_data.gpkg")

, but the support for Python should be considered work in progress. If you do choose to use Python, you will be expected to manage your own Python environment, and to be able to translate the R code examples into Python.

If you are feeling adventurous, you could try using Julia, JavaScript/TypeScript (e.g. via Observable) or another language, but you will be on your own in terms of support.

2 Prerequisites

2.1 Hardware

Access to a computer that you have permission to install software on, with at least 8 GB of RAM, is highly recommended. You could use a cloud-based service such as RStudio Cloud, Google Colab, or GitHub Codespaces, but you would need to be comfortable with using these services and would miss out on some of the benefits of using your own computer.

2.2 Software

Although you are free to use any software for the course, the emphasis on reproducibility and interactive data science means that popular popular and established data science languages such as R and Python are recommended.

The teaching will be delivered primarily in R, with some Python code snippets and examples. Unless you have a good reason to use Python, we recommend you use R for the course.

2.2.1 R

You should have the latest stable release of R (4.3.0 or above) and RStudio (recommended) or another IDE such as VS Code (if you have prior experience with it) installed on your computer.

  • R from cran.r-project.org
  • RStudio from rstudio.com (recommended) or VS Code with the R extension installed.
  • R packages, which can be installed by opening RStudio and typing install.packages("stats19") in the R console, for example.
  • To install all the dependencies for the module, run the following command in the R console:
if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}
remotes::install_github("itsleeds/tds")

See Section 1.5 of the online guide Reproducible Road Safety Research with R for instructions on how to install key packages we will use in the module.1

2.2.2 Python

If you choose to use Python, you should be able to install it and manage your own Python environment, including installing packages and dealing with package conflicts. If you use Python we recommend using pixi, which can manage both R and Python environments.

2.3 Command-line experience

You should be comfortable with computing in general, for example creating folders, moving files, and installing software. You should be comfortable with using command line interfaces such as PowerShell in Windows, Terminal in macOS, or the Linux shell.

2.4 Data science experience prerequisites

Prior experience of using R or Python (e.g. having used it for work, in previous degrees or having completed an online course) is essential.

Students can demonstrate this by showing evidence that they have worked with R before, have completed an online course such as the first 4 sessions in the RStudio Primers series or DataCamp’s Free Introduction to R course.

Evidence of substantial programming and data science experience in previous professional or academic work, in languages such as R or Python, also constitutes sufficient pre-requisite knowledge for the course.

Footnotes

  1. For further guidance on setting-up your computer to run R and RStudio for spatial data, see these links, we recommend Chapter 2 of Geocomputation with R (the Prerequisites section contains links for installing spatial software on Mac, Linux and Windows): https://geocompr.robinlovelace.net/spatial-class.html and Chapter 2 of the online book Efficient R Programming, particularly sections 2.3 and 2.5, for details on R installation and set-up and the project management section.↩︎