🗺
Transport Data Science
Invalid Date
It’s important to have an idea where you’re heading with the analysis
Often best to start with pen and paper
Tapson’s Rules of Machine Learning:
4. Time spent on data cleaning is an order of magnitude more productive than time spent on hyperparameter tuning.
(Extreme example: achieved a Top 10 result in Kaggle using linear regression, as the only team that cleaned 50/60Hz noise first.)— Jonathan Tapson ((jontapson?)) March 5, 2019
Source: https://twitter.com/jontapson/status/1103024752019402753
background-image: url() background-size: cover class: center, middle
Data science is climbing the DIKW pyramid
Recommendations
Build this here!
City-specific datasets
Hard-to-access national data
Open international/national datasets
Globally available, low-grade data (bottom)
Official sources are often smaller in sizes but higher in Quality
Unofficial sources provide higher volumes but tend to be noisy
Another way to classify data is by quality: signal/noise ratios
Globally available datasets would be at the bottom of this pyramid; local surveys at the top.
Which would be best to inform policy?
For other datasets, search online! Good starting points in your research may be:
See practical session at itsleeds.github.io/tds/p2/
That involves:
Getting data from OSM: overpass turbo
Data from stats19
Data from the Census
Bonus: getting data from Cadence platform