Accessing data from the Internet


🗺
Transport Data Science

Robin Lovelace

Invalid Date

Objectives

  • Learn where to find large transport datasets and assess data quality

Learning outcomes

  • Identify available datasets and access and clean them

This lecture will…

  • Be primarily practical
  • Provide an overview of data access options
  • Show how R packages and web services provide access to some datasets

Data access in context

  • It’s important to have an idea where you’re heading with the analysis

  • Often best to start with pen and paper

Data access/cleaning vs modelling time

Source: https://twitter.com/jontapson/status/1103024752019402753

background-image: url() background-size: cover class: center, middle

A typology of data sources

Information and data pyramids

Data science is climbing the DIKW pyramid

A geographic availability pyramid

  • Recommendations

  • Build this here!

  • City-specific datasets

    • Bristol cycle count data
  • Hard-to-access national data

  • Open international/national datasets

    • Open origin-destination data from UK Census
  • Globally available, low-grade data (bottom)

    • OpenStreetMap, Elevation data

An ease-of access pyramid

  • Data provision packages
    • Use the pct package
    • stats19 package
  • Pre-processed data
    • E.g. downloading data from website www.pct.bike
  • Messy official data
    • Raw STATS19 data

A geographic level of detail pyramid

  • Agents
  • Route networks
  • Nodes
  • Routes
  • Desire lines
  • Transport zones

Observations

  • Official sources are often smaller in sizes but higher in Quality

  • Unofficial sources provide higher volumes but tend to be noisy

  • Another way to classify data is by quality: signal/noise ratios

  • Globally available datasets would be at the bottom of this pyramid; local surveys at the top.

  • Which would be best to inform policy?

Portals

Online lists

For other datasets, search online! Good starting points in your research may be:

Data packages

Practical demo

See practical session at itsleeds.github.io/tds/p2/

  • That involves:

  • Getting data from OSM: overpass turbo

  • Data from stats19

  • Data from the Census

  • Bonus: getting data from Cadence platform

References

Lovelace, Robin, Anna Goodman, Rachel Aldred, Nikolai Berkoff, Ali Abbas, and James Woodcock. 2017. “The Propensity to Cycle Tool: An Open Source Online System for Sustainable Transport Planning.” Journal of Transport and Land Use 10 (1). https://doi.org/10.5198/jtlu.2016.862.
Lovelace, Robin, Malcolm Morgan, Layik Hama, and Mark Padgham. 2019. “Stats19: A Package for Working with Open Road Crash Data.” Journal of Open Source Software. https://doi.org/10/gkb498.
Wickham, Hadley, Mine Cetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd edition. Beijing Boston Farnham Sebastopol Tokyo: O’Reilly Media. https://r4ds.hadley.nz/.