Learn how to load, explore, filter, transform, and summarise data
CSV files are a common format for storing tabular data. You can read them into Python using the pd.read_csv() function from the pandas package.
Understand your data before analysis
The first step in exploring a dataset is to understand its structure and size.
You can quickly inspect the first few rows of your dataset to get a sense of the data.
For a more detailed look at the data, you can use data viewers in VS Code or Jupyter notebooks.
Summary statistics help you understand the distribution and key characteristics of your data.
Selecting specific rows based on conditions is a common task in data manipulation.
You may want to choose which variables to work with
Sort your data by column values
Add or modify variables using pandas
Calculate aggregate statistics
Calculate statistics for each group
Chain multiple operations together
import pandas as pd
# Complex workflow
result = (crashes
[crashes['casualty_age'] > 18]
[['casualty_type', 'casualty_age', 'vehicle_type', 'dark']]
.assign(age_group=lambda x: np.where(x['casualty_age'] > 50, 'Older', 'Younger'))
.groupby(['vehicle_type', 'age_group'])
.agg(n_casualties=('casualty_age', 'count'))
.reset_index()
.sort_values('n_casualties', ascending=False)
)Save your processed data to a new CSV file
Bringing it all together
Breaking down complex workflows into steps with meaningful variable names makes code more readable and maintainable.
import pandas as pd
# Step 1: Filter for dark conditions
crashes_dark = crashes[crashes['dark'] == True]
# Step 2: Select relevant columns
crashes_dark_subset = crashes_dark[['casualty_type', 'casualty_age', 'vehicle_type']]
# Step 3: Create age categories
crashes_with_ages = crashes_dark_subset.copy()
crashes_with_ages['age_category'] = pd.cut(crashes_with_ages['casualty_age'],
bins=[0, 25, 65, 150],
labels=['Young', 'Middle-aged', 'Older'])
# Step 4: Group and summarise
summary_by_type = crashes_with_ages.groupby(['casualty_type', 'age_category']).agg({
'casualty_age': ['count', 'mean']
}).reset_index()
summary_by_type.columns = ['casualty_type', 'age_category', 'n_casualties', 'mean_age']
# Step 5: Sort results
final_summary = summary_by_type.sort_values('n_casualties', ascending=False)
# Step 6: Export results
final_summary.to_csv('outputs/dark_condition_summary.csv', index=False)Good practice: Use descriptive variable names that show the transformation at each step!
✅ Load and explore data before analysis
✅ Filter rows based on conditions
✅ Select and rename columns strategically
✅ Create new variables with meaningful names
✅ Group and summarise data for insights
✅ Chain operations with method chaining
✅ Break complex workflows into named steps
✅ Export processed data for sharing
If you’re interested in R, here are equivalent concepts
CSV files are a common format for storing tabular data. You can read them into R using the read.csv() function or the read_csv() function from the readr package.
Understand your data structure and contents
# View dimensions
dim(crashes) # rows and columns
nrow(crashes) # number of rows
ncol(crashes) # number of columns
# View structure
str(crashes) # structure of each column
# Quick overview
head(crashes) # first 6 rows (default)
head(crashes, n = 10) # first 10 rows
# Glimpse function from dplyr
library(dplyr)
glimpse(crashes) # compact overviewSelecting specific rows based on conditions is a common task in data manipulation.
You may want to choose which variables to work with
Sort your data by column values
Add or modify variables
Calculate aggregate statistics
Calculate statistics for each group
Chain multiple operations together
library(dplyr)
# Complex workflow
result <- crashes |>
filter(casualty_age > 18) |>
select(casualty_type, casualty_age, vehicle_type, dark) |>
mutate(age_group = ifelse(casualty_age > 50, "Older", "Younger")) |>
group_by(vehicle_type, age_group) |>
summarise(n_casualties = n(), .groups = "drop") |>
arrange(desc(n_casualties))Save your processed data to a new CSV file
Bringing it all together
Breaking down complex workflows into steps with meaningful variable names makes code more readable and maintainable.
library(dplyr)
library(readr)
# Step 1: Filter for dark conditions
crashes_dark <- crashes |>
filter(dark == TRUE)
# Step 2: Select relevant columns
crashes_dark_subset <- crashes_dark |>
select(casualty_type, casualty_age, vehicle_type)
# Step 3: Create age categories
crashes_with_ages <- crashes_dark_subset |>
mutate(age_category = case_when(
casualty_age < 25 ~ "Young",
casualty_age < 65 ~ "Middle-aged",
TRUE ~ "Older"
))
# Step 4: Group and summarise
summary_by_type <- crashes_with_ages |>
group_by(casualty_type, age_category) |>
summarise(
n_casualties = n(),
mean_age = mean(casualty_age),
.groups = "drop"
)
# Step 5: Order results
final_summary <- summary_by_type |>
arrange(desc(n_casualties))
final_summary
# Step 6: Export results
write_csv(final_summary, "outputs/dark_condition_summary.csv")Good practice: Use descriptive variable names that show the transformation at each step!