Learn how to load, explore, filter, transform, and summarise data
CSV files are a common format for storing tabular data. You can read them into R using the read.csv() function or the read_csv() function from the readr package.
Understand your data before analysis
The first step in exploring a dataset is to understand its structure and size.
You can quickly inspect the first few rows of your dataset to get a sense of the data.
For a more detailed look at the data, you can use the View() function in RStudio, which opens the dataset in a spreadsheet-like viewer.
Summary statistics help you understand the distribution and key characteristics of your data.
Selecting specific rows based on conditions is a common task in data manipulation.
You may want to choose which variables to work with
Sort your data by column values
Add or modify variables
Calculate aggregate statistics
Calculate statistics for each group
Chain multiple operations together
library(dplyr)
# Complex workflow
result <- crashes |>
filter(casualty_age > 18) |>
select(casualty_type, casualty_age, vehicle_type, dark) |>
mutate(age_group = ifelse(casualty_age > 50, "Older", "Younger")) |>
group_by(vehicle_type, age_group) |>
summarise(n_casualties = n(), .groups = "drop") |>
arrange(desc(n_casualties))Save your processed data to a new CSV file
Bringing it all together
Breaking down complex workflows into steps with meaningful variable names makes code more readable and maintainable.
library(dplyr)
library(readr)
# Step 1: Filter for dark conditions
crashes_dark <- crashes |>
filter(dark == TRUE)
# Step 2: Select relevant columns
crashes_dark_subset <- crashes_dark |>
select(casualty_type, casualty_age, vehicle_type)
# Step 3: Create age categories
crashes_with_ages <- crashes_dark_subset |>
mutate(age_category = case_when(
casualty_age < 25 ~ "Young",
casualty_age < 65 ~ "Middle-aged",
TRUE ~ "Older"
))
# Step 4: Group and summarise
summary_by_type <- crashes_with_ages |>
group_by(casualty_type, age_category) |>
summarise(
n_casualties = n(),
mean_age = mean(casualty_age),
.groups = "drop"
)
# Step 5: Order results
final_summary <- summary_by_type |>
arrange(desc(n_casualties))
final_summary
# Step 6: Export results
write_csv(final_summary, "outputs/dark_condition_summary.csv")Good practice: Use descriptive variable names that show the transformation at each step!
✅ Load and explore data before analysis
✅ Filter rows based on conditions
✅ Select and rename columns strategically
✅ Create new variables with meaningful names
✅ Group and summarise data for insights
✅ Chain operations with the pipe operator (|>)
✅ Break complex workflows into named steps
✅ Export processed data for sharing
If you’re interested in Python, here are equivalent concepts
CSV files are read similarly in Python using pandas
Understand your data structure and contents
import pandas as pd
# View dimensions
crashes.shape # returns (rows, columns)
len(crashes) # number of rows
len(crashes.columns) # number of columns
# View structure
crashes.info() # structure of each column
crashes.head() # first 6 rows (default)
crashes.head(10) # first 10 rows
crashes.describe() # summary statisticsSelect specific rows based on conditions
Choose which variables to work with
Sort your data by column values
Add or modify variables using pandas
Calculate aggregate statistics
Calculate statistics for each group
Bringing it all together
import pandas as pd
# Step 1: Filter for dark conditions
crashes_dark = crashes[crashes['dark'] == True]
# Step 2: Select relevant columns
crashes_dark_subset = crashes_dark[['casualty_type', 'casualty_age', 'vehicle_type']]
# Step 3: Create age categories
crashes_with_ages = crashes_dark_subset.copy()
crashes_with_ages['age_category'] = pd.cut(crashes_with_ages['casualty_age'],
bins=[0, 25, 65, 150],
labels=['Young', 'Middle-aged', 'Older'])
# Step 4: Group and summarise
summary_by_type = crashes_with_ages.groupby(['casualty_type', 'age_category']).agg({
'casualty_age': ['count', 'mean']
}).reset_index()
summary_by_type.columns = ['casualty_type', 'age_category', 'n_casualties', 'mean_age']
# Step 5: Sort results
final_summary = summary_by_type.sort_values('n_casualties', ascending=False)
# Step 6: Export results
final_summary.to_csv('outputs/dark_condition_summary.csv', index=False)