Accessing data from the Internet

🗺
Transport Data Science

Robin Lovelace

Invalid Date

Objectives

Learn where to find large transport datasets and assess data quality

Learning outcomes

Identify available datasets and access and clean them

This lecture will…

Be primarily practical
Provide an overview of data access options
Show how R packages and web services provide access to some datasets

Data access in context

Data cleaning (or ‘tidying’ or ‘wrangling’) is part of a wider process (Wickham, Cetinkaya-Rundel, and Grolemund 2023)

It’s important to have an idea where you’re heading with the analysis
Often best to start with pen and paper

Data access/cleaning vs modelling time

Tapson’s Rules of Machine Learning:
4. Time spent on data cleaning is an order of magnitude more productive than time spent on hyperparameter tuning.

(Extreme example: achieved a Top 10 result in Kaggle using linear regression, as the only team that cleaned 50/60Hz noise first.)

— Jonathan Tapson ((jontapson?)) March 5, 2019

Source: https://twitter.com/jontapson/status/1103024752019402753

background-image: url() background-size: cover class: center, middle

A typology of data sources

Information and data pyramids

Data science is climbing the DIKW pyramid

A geographic availability pyramid

Recommendations
Build this here!
City-specific datasets
- Bristol cycle count data
Hard-to-access national data
Open international/national datasets
- Open origin-destination data from UK Census
Globally available, low-grade data (bottom)
- OpenStreetMap, Elevation data

An ease-of access pyramid

Data provision packages
- Use the pct package
- stats19 package
Pre-processed data
- E.g. downloading data from website www.pct.bike
Messy official data
- Raw STATS19 data

A geographic level of detail pyramid

Agents
Route networks
Nodes
Routes
Desire lines
Transport zones

Observations

Official sources are often smaller in sizes but higher in Quality
Unofficial sources provide higher volumes but tend to be noisy
Another way to classify data is by quality: signal/noise ratios
Globally available datasets would be at the bottom of this pyramid; local surveys at the top.
Which would be best to inform policy?

Portals

UK geoportal, providing geographic data at many levels
Other national geoportals exist
A good source of cleaned origin destination data is the Region downloads tab in the Propensity to Cycle Tool - see the Region data tab for West Yorkshire here, for example
OpenStreetMap is an excellent source of geographic data with global coverage. You can download data on specific queries (e.g. highway=cycleway) from the overpass-turbo service or with the osmdata or osmextract packages

Online lists

For other datasets, search online! Good starting points in your research may be:

The open data section in Geocomputation with R (r.geocompx.org/read-write)
Transport datasets mentioned in data.world
UK government transport data: Department for Transport

Data packages

The openrouteservice github package provides routing data
The stats19 package can get road crash data for anywhere in Great Britain (Lovelace et al. 2019) see docs.ropensci.org/stats19
The pct package provides access to data in the PCT project, including origin-destination data for the UK (Lovelace et al. 2017) see github.com/ITSLeeds/pct
There are many other R packages to help access data, including the spanishoddata package for Spanish origin-destination data

Practical demo

See practical session at itsleeds.github.io/tds/p2/

That involves:
Getting data from OSM: overpass turbo
Data from stats19
Data from the Census
Bonus: getting data from Cadence platform

References

Lovelace, Robin, Anna Goodman, Rachel Aldred, Nikolai Berkoff, Ali Abbas, and James Woodcock. 2017. “The Propensity to Cycle Tool: An Open Source Online System for Sustainable Transport Planning.” Journal of Transport and Land Use 10 (1). https://doi.org/10.5198/jtlu.2016.862.

Lovelace, Robin, Malcolm Morgan, Layik Hama, and Mark Padgham. 2019. “Stats19: A Package for Working with Open Road Crash Data.” Journal of Open Source Software. https://doi.org/10/gkb498.

Wickham, Hadley, Mine Cetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd edition. Beijing Boston Farnham Sebastopol Tokyo: O’Reilly Media. https://r4ds.hadley.nz/.