This is some text:
casualty_type = c(“pedestrian”, “cyclist”, “cat”) casualty_age = seq(from = 20, to = 60, by = 20)

See openstreetmap.org or search for other open access datasets for more ideas
Review of homework
Create a new folder (or R project with RStudio) called ‘session1’
In it create file called foundations.qmd
Type the following
This is some text:
casualty_type = c(“pedestrian”, “cyclist”, “cat”) casualty_age = seq(from = 20, to = 60, by = 20)
quarto render foundations.qmd in the PowerShell or Terminal console, the result should look like this (see Figure 1):
The document you are reading is a Quarto document, so we can show the output of the contents of the code in foundations.qmd here:
casualty_type = c("pedestrian", "cyclist", "cat")
casualty_age = seq(from = 20, to = 60, by = 20)
crashes = data.frame(casualty_type, casualty_age)
nrow(crashes)[1] 3
There are different ways to execute code. When you run the code by ‘compiling’ the document, the objects are created in a different session. For interactive data analysis, it is better to run the code in the console.
Make sure that you also execute the code in the console so that the objects are created in memory. Do that by placing the cursor in the code chunk and pressing Ctrl+Enter (or Cmd+Enter on a Mac) or by copying and pasting the code into the console (not recommended). We now have a data frame object stored in memory (technically in the global environment) that is used as the basis of the questions.
Next, add to the data frame you created by adding the following code to the code chunk in the .qmd file:
vehicle_type = c("car", "bus", "tank")
crashes$vehicle_type = vehicle_typeWhat just happened?
We will explore this together, and then you can try the following data manipulation exercises:
$ operator to print the vehicle_type column of crashes.In R the $ symbol is used to refer to elemements of a list. So the answer is simply:
crashes$vehicle_type[1] "car" "bus" "tank"
[,] syntaxTry out different combinations on the dataframe crashes to see what happens. For example, try:
crashes[1,] casualty_type casualty_age vehicle_type
1 pedestrian 20 car
crashes[,1][1] "pedestrian" "cyclist" "cat"
crashes[1,1][1] "pedestrian"
[[ syntax.The [[ operator is used to extract elements from a list. Try:
crashes[[1]][1] "pedestrian" "cyclist" "cat"
crashes[[2]][1] 20 40 60
class() of the objects created by each of the previous exercises?pandas or polars packageWork through the following example on road traffic data (recommended for most people) or the NTS data (for people more interested in travel survey data). You can do both if you have time.
To get some larger datasets, try the following (from Chapter 8 of RSRR):
remotes::install_cran("stats19")
library(stats19)
ac = get_stats19(year = 2020, type = "collision")
ca = get_stats19(year = 2020, type = "cas")
ve = get_stats19(year = 2020, type = "veh")
# population hurt by road traffic collisions in 2020:
(nrow(ca) / 67e6) * 100[1] 0.1725134
Challenge: reproduce the above code in Python using the pystats19 package
# Install the package, e.g. with pip
!pip install pystats19
import pystats19
# See the documentation at https://github.com/Mayazure/py-stats19Let’s go through these exercises together:
casualty_age object using the inequality (<) so that only elements less than 50 are returned.crashes data frame so that only tanks are returned using the == operator.ac object created previouslyvehicle_type column of crashes to the class character.crashes object into a matrix. What happened to the values?summary() on character and factor variables?Note: you will need to download the modified NTS 2022 data from your Minerva module page and place it in your working directory for this section to work.
# Read CSV file
NTS_data <- read.csv("NTS2022_modifieddata.csv")
# Look at the column names
names(NTS_data)
# Look at the data
head(NTS_data)You should see something like this:
> names(NTS_data)
[1] "IndividualID" "avg_trip_length"
[3] "avg_trip_length_weekday" "avg_trip_length_weekend"
[5] "total_distance" "total_distance_weekday"
[7] "total_distance_weekend" "SD_triplength"
[9] "sd_Total_Distance_wknd" "sd_Total_Distance_wk"
> head(NTS_data)
IndividualID avg_trip_length avg_trip_length_weekday avg_trip_length_weekend
1 2023000001 4.080000 4.631579 2.333333
2 2023000002 2.538462 2.400000 3.000000
3 2023000003 5.916667 6.250000 5.250000
Visualising datasets is important when dealing with large volumes of data, as visualisations help convey complex information in an easily interpretable format. Consider the histogram plots of average trip lengths over a week in the UK.
# Note: This requires ggplot2 library to be loaded first
library(tidyverse) # Tidyverse contains ggplot2 and other useful packages
ggplot(NTS_data, aes(x = avg_trip_length)) +
geom_histogram(binwidth = 1, fill = "darkgrey") +
labs(
title = "Avg. Trip Length in Whole Week",
x = "Trip Length (km)",
y = "Number of Individuals"
) +
theme_minimal() +
xlim(0, 50) Data exploration or “exploratory data analysis” (EDA) involves examining datasets in depth to uncover underlying patterns or differences. The direction of this investigation is largely guided by the research question.
Consider different histogram plots for weekdays and weekends. Can you identify any differences between them? (Clue: Check the number of individuals between 0-1 Km)
Think: What could be plausible reasons for such difference?
ggplot(NTS_data, aes(x = avg_trip_length_weekday)) +
geom_histogram(binwidth = 1, fill = "darkblue") +
labs(
title = "Avg. Trip Length on Weekdays",
x = "Trip Length (km)",
y = "Number of Individuals"
) +
theme_minimal() +
xlim(0, 50) ggplot(NTS_data, aes(x = avg_trip_length_weekend)) +
geom_histogram(binwidth = 1, fill = "darkred") +
labs(
title = "Avg. Trip Length on Weekends",
x = "Trip Length (km)",
y = "Number of Individuals"
) +
theme_minimal() +
xlim(0, 50) You can more easily compare the two histograms when they are placed in the same plot, with transparency added to the bars:
g_combined = ggplot() +
geom_histogram(data = NTS_data, aes(x = avg_trip_length_weekday),
binwidth = 1, fill = "darkblue", alpha = 0.5) +
geom_histogram(data = NTS_data, aes(x = avg_trip_length_weekend),
binwidth = 1, fill = "darkred", alpha = 0.5) +
labs(
title = "Avg. Trip Length on Weekdays (blue) and Weekends (red)",
x = "Trip Length (km)",
y = "Number of Individuals"
) +
theme_minimal() +
xlim(0, 50)
# Then 'print' the plot to show it:
g_combinedYou can save the plot with ggsave():
ggsave("avg_trip_length_weekday_weekend.png", plot = g_combined, width = 8, height = 6)And (this is how you can show figures in Quarto), in a quarto document (.qmd file) that you will use to write and submit your coursework, you can include the saved figure like this (we will come onto this later in the module):


Don’t they largely look the same? Can you stop here and infer that the trip length distributions for weekdays and weekends are largely similar? You might, depending on the resources at your disposal, but from an academic point of view we need to think about other potential dimensions where they could be different.
Consider different histogram plots for ‘Standard Deviation’ of trip lengths over weekdays and weekends. Can you identify any differences between them? (Clue: Again, check the number of individuals with SD 0-2 Km)
Think: What could be plausible reasons for such difference?
ggplot(NTS_data, aes(x = sd_Total_Distance_wk)) +
geom_histogram(binwidth = 0.5, fill = "darkblue") +
labs(
title = "SD of Trip Length on Weekdays",
x = "SD of trip length (km)",
y = "Number of Individuals"
) +
theme_minimal() +
xlim(0, 25) + ylim(0,1000) ggplot(NTS_data, aes(x = sd_Total_Distance_wknd)) +
geom_histogram(binwidth = 0.5, fill = "darkred") +
labs(
title = "SD of Trip Length on Weekends",
x = "SD of trip length (km)",
y = "Number of Individuals"
) +
theme_minimal() +
xlim(0, 25) + ylim(0,1000) Read and try to complete the exercises in Chapters 1 to 5 of the book Reproducible Road Safety Research with R.
It assumes that you have recently updated R and RStudio on your computer. For details on installing packages see here.
Reproduce the code I wrote during this session, e.g. by copy-pasting this code into the console or source editor of RStudio: github.com/github.com/itsleeds/tds/blob/main/s1/s1project/foundations.qmd
Work through Chapter 13 of the Geocomputation with R book. Make notes in a .qmd file that you can bring to the class to show colleagues and the instructor next week.
Think of a research question that you could answer with data science, and write it down in a .qmd file. Include a sketch of the data you would need to answer the question.
Sign-up to the Cadence platform as outlined at itsleeds.github.io/tds/s2/#the-cadence-platform