Statistics - Worked Solutions

This page presents the worked solutions for the statistics tasks, showing how to perform the same analyses in both R and Python.

Before you proceed…

These are worked solutions. We strongly recommend attempting to debug the R and Python problem scripts in the Tasks section before reviewing these solutions. Learning to debug is a crucial data science skill!

Task 1: Excel Data Analysis

This task involves reading data from an Excel file (ParkRunPerformanceData.xlsx), cleaning it, calculating summaries, and visualizing the distribution of run times.

Code

# Loading the necessary libraries
library(tidyverse)
library(readxl)

# Assigning the path to a variable
path_to_file <- "../00_data/ParkRunPerformanceData.xlsx" # Corrected path

# Reading data from an Excel file
data <- read_excel(path = path_to_file,
                   sheet = "Sheet1",
                   col_types = c("date", "numeric"))

# Exploring the data
cat("---\n--- Head of Data ---\n---")

---
--- Head of Data ---
---

Code

print(head(data))

# A tibble: 6 × 2
  `Event Date`        `Run Time (minutes)`
  <dttm>                             <dbl>
1 2015-08-22 00:00:00                 20.6
2 2015-08-15 00:00:00                 22.4
3 2015-08-08 00:00:00                 27.0
4 2015-08-01 00:00:00                 22.2
5 2015-07-25 00:00:00                 24.8
6 2015-07-18 00:00:00                 21.4

Code

cat("\n--- Structure of Data ---")


--- Structure of Data ---

Code

str(data)

tibble [120 × 2] (S3: tbl_df/tbl/data.frame)
 $ Event Date        : POSIXct[1:120], format: "2015-08-22" "2015-08-15" ...
 $ Run Time (minutes): num [1:120] 20.6 22.4 27 22.2 24.8 ...

Code

cat("\n--- Summary Statistics ---")


--- Summary Statistics ---

Code

summary(data)

   Event Date                  Run Time (minutes)
 Min.   :2012-08-04 00:00:00   Min.   :20.63     
 1st Qu.:2013-09-07 00:00:00   1st Qu.:21.78     
 Median :2014-04-08 12:00:00   Median :23.07     
 Mean   :2014-04-07 21:48:00   Mean   :23.42     
 3rd Qu.:2014-11-23 18:00:00   3rd Qu.:24.82     
 Max.   :2015-08-22 00:00:00   Max.   :31.13

Code

# Rename the columns to ensure consistency and ease manipulation
names(data) <- c("date","runtime")

# Sorting the data
data_sorted <- data |> arrange(runtime)
cat("\n--- Sorted Data (Ascending) ---")


--- Sorted Data (Ascending) ---

Code

print(head(data_sorted))

# A tibble: 6 × 2
  date                runtime
  <dttm>                <dbl>
1 2015-08-22 00:00:00    20.6
2 2014-09-27 00:00:00    20.7
3 2015-04-18 00:00:00    20.7
4 2015-04-25 00:00:00    20.8
5 2015-04-04 00:00:00    21.0
6 2014-11-29 00:00:00    21.0

Code

data_sorted_inv <- data |> arrange(-runtime)
cat("\n--- Sorted Data (Descending) ---")


--- Sorted Data (Descending) ---

Code

print(head(data_sorted_inv))

# A tibble: 6 × 2
  date                runtime
  <dttm>                <dbl>
1 2013-10-19 00:00:00    31.1
2 2013-07-20 00:00:00    30.6
3 2013-03-16 00:00:00    29.4
4 2014-02-01 00:00:00    27.5
5 2013-01-12 00:00:00    27.4
6 2013-06-15 00:00:00    27.3

Code

# Calculating the summaries manually 
summary_dates <- 
  data |>
  summarise(min_date = min(date),
            max_date = max(date))
cat("\n--- Summary Dates ---")


--- Summary Dates ---

Code

print(summary_dates)

# A tibble: 1 × 2
  min_date            max_date           
  <dttm>              <dttm>             
1 2012-08-04 00:00:00 2015-08-22 00:00:00

Code

summary_runtimes <- 
  data |>
  summarise(
            count = n(),
            mean_runtime = mean(runtime),
            slowest = max(runtime),
            fastest = min(runtime))
cat("\n--- Summary Runtimes ---")


--- Summary Runtimes ---

Code

print(summary_runtimes)

# A tibble: 1 × 4
  count mean_runtime slowest fastest
  <int>        <dbl>   <dbl>   <dbl>
1   120         23.4    31.1    20.6

Code

# Rounding the run times to the nearest minute
data_rounded <- data |> 
  mutate(runtime_mins = round(x = runtime, digits = 0))
cat("\n--- Data with Rounded Runtimes ---")


--- Data with Rounded Runtimes ---

Code

print(head(data_rounded))

# A tibble: 6 × 3
  date                runtime runtime_mins
  <dttm>                <dbl>        <dbl>
1 2015-08-22 00:00:00    20.6           21
2 2015-08-15 00:00:00    22.4           22
3 2015-08-08 00:00:00    27.0           27
4 2015-08-01 00:00:00    22.2           22
5 2015-07-25 00:00:00    24.8           25
6 2015-07-18 00:00:00    21.4           21

Code

# Counting the frequencies for each value
freq <- data_rounded |> 
  count(runtime_mins)
cat("\n--- Frequencies ---")


--- Frequencies ---

Code

print(freq)

# A tibble: 9 × 2
  runtime_mins     n
         <dbl> <int>
1           21    19
2           22    32
3           23    17
4           24    19
5           25    15
6           26    10
7           27     5
8           29     1
9           31     2

Code

# A quick histogram
hist(data_rounded$runtime_mins,breaks = 14:33, main = "Quick Histogram of Run Times")

Code

# A nicer histogram
data |> 
  ggplot(aes(x = runtime))+
  geom_histogram(binwidth = 1, col = "white",fill = "steelblue", alpha = 0.7)+
  labs (x = "Run time in seconds",
        y = "frequency",
        title = "Park Run Times Distribution",
        subtitle = "Records from Aug 2012 to Aug 2015",
        caption = "Source: Andrew Tomlinson")+
  scale_x_continuous(breaks = 14:33)+
  geom_hline(yintercept = 0,linewidth = 1,col = "grey30")+
  theme_minimal()+
  theme(title = element_text(size = 15))

Code

# Have the run times improved?
  # A different exploration
data |> 
  ggplot(aes(x = date, y = runtime))+
  geom_point() +  
  geom_smooth(method = "lm")

Code

import polars as pl
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

sns.set_theme(style="whitegrid")

# Assigning the path to a variable
path_to_file = "../00_data/ParkRunPerformanceData.xlsx" # Corrected path

# Reading data from an Excel file
data = pl.from_pandas(pd.read_excel(path_to_file, sheet_name="Sheet1"))

# Exploring the data
print("---\n--- Head of Data ---\n---")

---
--- Head of Data ---
---

Code

print(data.head())

shape: (5, 2)
┌─────────────────────┬────────────────────┐
│ Event Date          ┆ Run Time (minutes) │
│ ---                 ┆ ---                │
│ datetime[ns]        ┆ f64                │
╞═════════════════════╪════════════════════╡
│ 2015-08-22 00:00:00 ┆ 20.633333          │
│ 2015-08-15 00:00:00 ┆ 22.383333          │
│ 2015-08-08 00:00:00 ┆ 26.966667          │
│ 2015-08-01 00:00:00 ┆ 22.25              │
│ 2015-07-25 00:00:00 ┆ 24.816667          │
└─────────────────────┴────────────────────┘

Code

print("\n--- Schema ---")


--- Schema ---

Code

print(data.schema)

Schema({'Event Date': Datetime(time_unit='ns', time_zone=None), 'Run Time (minutes)': Float64})

Code

print("\n--- Summary Statistics ---")


--- Summary Statistics ---

Code

print(data.describe())

shape: (9, 3)
┌────────────┬─────────────────────┬────────────────────┐
│ statistic  ┆ Event Date          ┆ Run Time (minutes) │
│ ---        ┆ ---                 ┆ ---                │
│ str        ┆ str                 ┆ f64                │
╞════════════╪═════════════════════╪════════════════════╡
│ count      ┆ 120                 ┆ 120.0              │
│ null_count ┆ 0                   ┆ 0.0                │
│ mean       ┆ 2014-04-07 21:48:00 ┆ 23.420278          │
│ std        ┆ null                ┆ 2.059807           │
│ min        ┆ 2012-08-04 00:00:00 ┆ 20.633333          │
│ 25%        ┆ 2013-09-14 00:00:00 ┆ 21.8               │
│ 50%        ┆ 2014-04-12 00:00:00 ┆ 23.116667          │
│ 75%        ┆ 2014-11-22 00:00:00 ┆ 24.816667          │
│ max        ┆ 2015-08-22 00:00:00 ┆ 31.133333          │
└────────────┴─────────────────────┴────────────────────┘

Code

# Rename the columns
if len(data.columns) >= 2:
    data = data.rename({data.columns[0]: "date", data.columns[1]: "runtime"})

# Sorting the data
data_sorted = data.sort("runtime")
print("\n--- Sorted Data (Ascending) ---")


--- Sorted Data (Ascending) ---

Code

print(data_sorted.head())

shape: (5, 2)
┌─────────────────────┬───────────┐
│ date                ┆ runtime   │
│ ---                 ┆ ---       │
│ datetime[ns]        ┆ f64       │
╞═════════════════════╪═══════════╡
│ 2015-08-22 00:00:00 ┆ 20.633333 │
│ 2014-09-27 00:00:00 ┆ 20.666667 │
│ 2015-04-18 00:00:00 ┆ 20.733333 │
│ 2015-04-25 00:00:00 ┆ 20.816667 │
│ 2015-04-04 00:00:00 ┆ 20.966667 │
└─────────────────────┴───────────┘

Code

data_sorted_inv = data.sort("runtime", descending=True)
print("\n--- Sorted Data (Descending) ---")


--- Sorted Data (Descending) ---

Code

print(data_sorted_inv.head())

shape: (5, 2)
┌─────────────────────┬───────────┐
│ date                ┆ runtime   │
│ ---                 ┆ ---       │
│ datetime[ns]        ┆ f64       │
╞═════════════════════╪═══════════╡
│ 2013-10-19 00:00:00 ┆ 31.133333 │
│ 2013-07-20 00:00:00 ┆ 30.566667 │
│ 2013-03-16 00:00:00 ┆ 29.416667 │
│ 2014-02-01 00:00:00 ┆ 27.466667 │
│ 2013-01-12 00:00:00 ┆ 27.45     │
└─────────────────────┴───────────┘

Code

# Calculating summaries
summary_dates = data.select([
    pl.col("date").min().alias("min_date"),
    pl.col("date").max().alias("max_date")
])
print("\n--- Summary Dates ---")


--- Summary Dates ---

Code

print(summary_dates)

shape: (1, 2)
┌─────────────────────┬─────────────────────┐
│ min_date            ┆ max_date            │
│ ---                 ┆ ---                 │
│ datetime[ns]        ┆ datetime[ns]        │
╞═════════════════════╪═════════════════════╡
│ 2012-08-04 00:00:00 ┆ 2015-08-22 00:00:00 │
└─────────────────────┴─────────────────────┘

Code

summary_runtimes = data.select([
    pl.len().alias("count"),
    pl.col("runtime").mean().alias("mean_runtime"),
    pl.col("runtime").max().alias("slowest"),
    pl.col("runtime").min().alias("fastest")
])
print("\n--- Summary Runtimes ---")


--- Summary Runtimes ---

Code

print(summary_runtimes)

shape: (1, 4)
┌───────┬──────────────┬───────────┬───────────┐
│ count ┆ mean_runtime ┆ slowest   ┆ fastest   │
│ ---   ┆ ---          ┆ ---       ┆ ---       │
│ u32   ┆ f64          ┆ f64       ┆ f64       │
╞═══════╪══════════════╪═══════════╪═══════════╡
│ 120   ┆ 23.420278    ┆ 31.133333 ┆ 20.633333 │
└───────┴──────────────┴───────────┴───────────┘

Code

# Rounding the run times
data_rounded = data.with_columns(
    pl.col("runtime").round(0).alias("runtime_mins")
)
print("\n--- Data with Rounded Runtimes ---")


--- Data with Rounded Runtimes ---

Code

print(data_rounded.head())

shape: (5, 3)
┌─────────────────────┬───────────┬──────────────┐
│ date                ┆ runtime   ┆ runtime_mins │
│ ---                 ┆ ---       ┆ ---          │
│ datetime[ns]        ┆ f64       ┆ f64          │
╞═════════════════════╪═══════════╪══════════════╡
│ 2015-08-22 00:00:00 ┆ 20.633333 ┆ 21.0         │
│ 2015-08-15 00:00:00 ┆ 22.383333 ┆ 22.0         │
│ 2015-08-08 00:00:00 ┆ 26.966667 ┆ 27.0         │
│ 2015-08-01 00:00:00 ┆ 22.25     ┆ 22.0         │
│ 2015-07-25 00:00:00 ┆ 24.816667 ┆ 25.0         │
└─────────────────────┴───────────┴──────────────┘

Code

# Frequencies
freq = data_rounded.group_by("runtime_mins").len().sort("runtime_mins")
print("\n--- Frequencies ---")


--- Frequencies ---

Code

print(freq)

shape: (9, 2)
┌──────────────┬─────┐
│ runtime_mins ┆ len │
│ ---          ┆ --- │
│ f64          ┆ u32 │
╞══════════════╪═════╡
│ 21.0         ┆ 19  │
│ 22.0         ┆ 32  │
│ 23.0         ┆ 17  │
│ 24.0         ┆ 19  │
│ 25.0         ┆ 15  │
│ 26.0         ┆ 10  │
│ 27.0         ┆ 5   │
│ 29.0         ┆ 1   │
│ 31.0         ┆ 2   │
└──────────────┴─────┘

Code

# Nicer Histogram
plt.figure(figsize=(10, 6))
sns.histplot(
    data=data_rounded.to_pandas(), 
    x="runtime", 
    binwidth=1, 
    color="steelblue", 
    edgecolor="white", 
    alpha=0.7
)
plt.title("Park Run Times Distribution\nRecords from Aug 2012 to Aug 2015", fontsize=15)
plt.xlabel("Run time in seconds")
plt.ylabel("frequency")
plt.xlim(14, 33)

(14.0, 33.0)

Code

plt.axhline(0, color="grey", linewidth=1)
plt.figtext(0.8, 0.01, "Source: Andrew Tomlinson", wrap=True, horizontalalignment='center', fontsize=10) # Uncommented
plt.show()

Various Outputs from Excel Task (Python)

Code

# Have the run times improved?
plt.figure(figsize=(10, 6))
pdf = data.to_pandas()
# Ensure date is numeric for regression plot
pdf['date_num'] = mdates.date2num(pdf['date'])

sns.regplot(
    data=pdf, 
    x='date_num', 
    y="runtime", 
    scatter_kws={'s':10}, 
    line_kws={'color':'blue'}
)
# Fix x-axis to show dates
ax = plt.gca()
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.title("Run Times over Date")
plt.show() # Uncommented

Task 2: SPSS Data Analysis

This task analyzes running data originally handled in SPSS (RunningData.xlsx), including filtering, grouping, and statistical tests.

Code

library(tidyverse)
library(readxl)

path_to_file <- "../00_data/RunningData.xlsx"
data <- read_excel(path = path_to_file, sheet = "Sheet1")

# Rename columns
names(data) <- c("position", "time", "age_cat", "gender", "prev_runs")

# Summaries
cat("---\n--- Summary of Time ---")

---
--- Summary of Time ---

Code

summary(data$time)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  16.90   24.40   27.88   28.31   32.00   43.23

Code

cat("\n--- Unique Age Categories ---")


--- Unique Age Categories ---

Code

unique(data$age_cat)

 [1] "30-34" "20-24" "25-29" "40-44" "15-17" "35-39" "45-49" "50-54" "55-59"
[10] "11-14" "10"    "60-64" "18-19" "65-69" "70-74" "75-79"

Code

# Subset only adults 
children_cats <- c("10","11-14","15-17")
data_adults <- data |> 
  filter(!age_cat %in% children_cats)

cat("\n--- Unique Adult Age Categories ---")


--- Unique Adult Age Categories ---

Code

unique(data_adults$age_cat)

 [1] "30-34" "20-24" "25-29" "40-44" "35-39" "45-49" "50-54" "55-59" "60-64"
[10] "18-19" "65-69" "70-74" "75-79"

Code

cat("\n--- Summary of Adult Time ---")


--- Summary of Adult Time ---

Code

summary(data_adults$time)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  16.90   24.34   27.68   28.16   31.72   42.68

Code

# Analysis by gender
cat("\n--- Analysis by Gender ---")


--- Analysis by Gender ---

Code

data_adults |> 
  group_by(gender) |> 
  summarise(min = min(time),
            mean = mean(time),
            median = median(time),
            max = max(time))

# A tibble: 2 × 5
  gender   min  mean median   max
  <chr>  <dbl> <dbl>  <dbl> <dbl>
1 F       21.1  31.0   30.8  42.7
2 M       16.9  25.2   24.7  40.8

Code

# Comparing distributions
data_adults |> 
  ggplot(aes(x = time, fill = gender))+
  geom_histogram(col = "white")+
  facet_grid(gender~.)

Code

data_adults |> 
  ggplot(aes(x = time, col = gender))+
  geom_density()

Code

data_adults |> 
  ggplot(aes(x = time, col = gender))+
  geom_boxplot()

Code

data_adults |> 
  ggplot(aes(x = time, y = gender, col = gender))+
  geom_violin()

Code

## Statistical tests
# Extracting the data
times_female_adults <- data_adults |> filter(gender == "F") |> pull(time)
times_male_adults <- data_adults |> filter(gender == "M") |> pull(time)

cat("\n--- T-Test Results (Male vs Female) ---")


--- T-Test Results (Male vs Female) ---

Code

t.test(times_male_adults, times_female_adults)


    Welch Two Sample t-test

data:  times_male_adults and times_female_adults
t = -14.636, df = 521.03, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6.622564 -5.055101
sample estimates:
mean of x mean of y 
 25.15512  30.99395

Code

# Previous runs vs times
data_adults |> 
  ggplot(aes(x = prev_runs, y = time))+
  geom_point()+
  geom_smooth(method = "lm")

Code

data_adults |> 
  ggplot(aes(x = prev_runs, y =time, col = gender))+
  geom_point() +
  geom_smooth(method = "lm")

Code

cat("\n--- Correlation Test (Time vs Previous Runs) ---")


--- Correlation Test (Time vs Previous Runs) ---

Code

cor.test(data_adults$time, data_adults$prev_runs)


    Pearson's product-moment correlation

data:  data_adults$time and data_adults$prev_runs
t = -5.4024, df = 522, p-value = 1.001e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.3096642 -0.1473580
sample estimates:
       cor 
-0.2301107

Code

# Finding the median of prev runs
median_prev_runs <- median(data_adults$prev_runs)
cat("\n--- Median Previous Runs ---")


--- Median Previous Runs ---

Code

print(median_prev_runs)

[1] 18

Code

data_adults_pr_gr <- data_adults |> 
  mutate(pr_gr = prev_runs>=median_prev_runs)

cat("\n--- Data with pr_gr group ---")


--- Data with pr_gr group ---

Code

print(head(data_adults_pr_gr))

# A tibble: 6 × 6
  position  time age_cat gender prev_runs pr_gr
     <dbl> <dbl> <chr>   <chr>      <dbl> <lgl>
1        1  16.9 30-34   M             60 TRUE 
2        2  17.2 20-24   M             91 TRUE 
3        3  17.3 25-29   M             26 TRUE 
4        4  17.4 40-44   M             14 FALSE
5        5  17.4 40-44   M              4 FALSE
6        7  18.2 35-39   M             19 TRUE

Code

# a quick visual check
data_adults_pr_gr |> 
  ggplot(aes(prev_runs,fill = pr_gr))+
  geom_histogram()

Code

# Comparing times
data_adults_pr_gr |> 
  ggplot(aes(x = time, col = pr_gr))+
  geom_boxplot()

Code

# Linear Model
# Extract first two digits of age category for numeric age proxy
data_adults_lm <- data_adults |> 
  mutate(age = str_extract(age_cat, '^\\d{2}') |> as.numeric())  

my_linear_model <- lm(formula = "time ~ age + gender + prev_runs", data = data_adults_lm) 
cat("\n--- Linear Model Summary ---")


--- Linear Model Summary ---

Code

summary(my_linear_model)


Call:
lm(formula = "time ~ age + gender + prev_runs", data = data_adults_lm)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.3003  -3.3456  -0.3801   2.5654  15.2136 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.198335   0.723994  41.711  < 2e-16 ***
age          0.043466   0.018523   2.347   0.0193 *  
genderM     -5.657628   0.395852 -14.292  < 2e-16 ***
prev_runs   -0.035480   0.007473  -4.748 2.66e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.473 on 520 degrees of freedom
Multiple R-squared:  0.322, Adjusted R-squared:  0.3181 
F-statistic: 82.31 on 3 and 520 DF,  p-value: < 2.2e-16

Code

import polars as pl
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.formula.api as smf
import numpy as np

sns.set_theme(style="whitegrid")

path_to_file = "../00_data/RunningData.xlsx"
data = pl.from_pandas(pd.read_excel(path_to_file, sheet_name="Sheet1"))

# Rename columns
if len(data.columns) >= 5:
    data = data.rename({
        data.columns[0]: "position",
        data.columns[1]: "time",
        data.columns[2]: "age_cat",
        data.columns[3]: "gender",
        data.columns[4]: "prev_runs"
    })

print("---\n--- Head ---")

---
--- Head ---

Code

print(data.head())

shape: (5, 5)
┌──────────┬───────────┬─────────┬────────┬───────────┐
│ position ┆ time      ┆ age_cat ┆ gender ┆ prev_runs │
│ ---      ┆ ---       ┆ ---     ┆ ---    ┆ ---       │
│ i64      ┆ f64       ┆ str     ┆ str    ┆ i64       │
╞══════════╪═══════════╪═════════╪════════╪═══════════╡
│ 1        ┆ 16.9      ┆ 30-34   ┆ M      ┆ 60        │
│ 2        ┆ 17.166667 ┆ 20-24   ┆ M      ┆ 91        │
│ 3        ┆ 17.3      ┆ 25-29   ┆ M      ┆ 26        │
│ 4        ┆ 17.366667 ┆ 40-44   ┆ M      ┆ 14        │
│ 5        ┆ 17.383333 ┆ 40-44   ┆ M      ┆ 4         │
└──────────┴───────────┴─────────┴────────┴───────────┘

Code

print("\n--- Schema ---")


--- Schema ---

Code

print(data.schema)

Schema({'position': Int64, 'time': Float64, 'age_cat': String, 'gender': String, 'prev_runs': Int64})

Code

print("\n--- Summary ---")


--- Summary ---

Code

print(data.describe())

shape: (9, 6)
┌────────────┬───────────┬───────────┬─────────┬────────┬───────────┐
│ statistic  ┆ position  ┆ time      ┆ age_cat ┆ gender ┆ prev_runs │
│ ---        ┆ ---       ┆ ---       ┆ ---     ┆ ---    ┆ ---       │
│ str        ┆ f64       ┆ f64       ┆ str     ┆ str    ┆ f64       │
╞════════════╪═══════════╪═══════════╪═════════╪════════╪═══════════╡
│ count      ┆ 608.0     ┆ 608.0     ┆ 608     ┆ 608    ┆ 608.0     │
│ null_count ┆ 0.0       ┆ 0.0       ┆ 0       ┆ 0      ┆ 0.0       │
│ mean       ┆ 304.5     ┆ 28.31176  ┆ null    ┆ null   ┆ 26.036184 │
│ std        ┆ 175.65876 ┆ 5.473712  ┆ null    ┆ null   ┆ 26.738812 │
│ min        ┆ 1.0       ┆ 16.9      ┆ 10      ┆ F      ┆ 1.0       │
│ 25%        ┆ 153.0     ┆ 24.4      ┆ null    ┆ null   ┆ 8.0       │
│ 50%        ┆ 305.0     ┆ 27.883333 ┆ null    ┆ null   ┆ 17.0      │
│ 75%        ┆ 456.0     ┆ 32.0      ┆ null    ┆ null   ┆ 34.0      │
│ max        ┆ 608.0     ┆ 43.233333 ┆ 75-79   ┆ M      ┆ 189.0     │
└────────────┴───────────┴───────────┴─────────┴────────┴───────────┘

Code

print("\n--- Time Summary ---")


--- Time Summary ---

Code

print(data["time"].describe())

shape: (9, 2)
┌────────────┬───────────┐
│ statistic  ┆ value     │
│ ---        ┆ ---       │
│ str        ┆ f64       │
╞════════════╪═══════════╡
│ count      ┆ 608.0     │
│ null_count ┆ 0.0       │
│ mean       ┆ 28.31176  │
│ std        ┆ 5.473712  │
│ min        ┆ 16.9      │
│ 25%        ┆ 24.4      │
│ 50%        ┆ 27.883333 │
│ 75%        ┆ 32.0      │
│ max        ┆ 43.233333 │
└────────────┴───────────┘

Code

# Subset adults
children_cats = ["10", "11-14", "15-17"]
data_adults = data.filter(~pl.col("age_cat").is_in(children_cats))

print("\n--- Adult Age Categories ---\n")


--- Adult Age Categories ---

Code

print(data_adults["age_cat"].unique())

shape: (13,)
Series: 'age_cat' [str]
[
    "70-74"
    "18-19"
    "20-24"
    "75-79"
    "30-34"
    …
    "40-44"
    "35-39"
    "65-69"
    "55-59"
    "50-54"
]

Code

print("\n--- Adult Time Summary ---")


--- Adult Time Summary ---

Code

print(data_adults["time"].describe())

shape: (9, 2)
┌────────────┬───────────┐
│ statistic  ┆ value     │
│ ---        ┆ ---       │
│ str        ┆ f64       │
╞════════════╪═══════════╡
│ count      ┆ 524.0     │
│ null_count ┆ 0.0       │
│ mean       ┆ 28.163677 │
│ std        ┆ 5.417044  │
│ min        ┆ 16.9      │
│ 25%        ┆ 24.35     │
│ 50%        ┆ 27.683333 │
│ 75%        ┆ 31.716667 │
│ max        ┆ 42.683333 │
└────────────┴───────────┘

Code

# Analysis by gender
print("\n--- Analysis by Gender ---")


--- Analysis by Gender ---

Code

summary_gender = data_adults.group_by("gender").agg([
    pl.col("time").min().alias("min"),
    pl.col("time").mean().alias("mean"),
    pl.col("time").median().alias("median"),
    pl.col("time").max().alias("max")
])
print(summary_gender)

shape: (2, 5)
┌────────┬──────┬───────────┬───────────┬───────────┐
│ gender ┆ min  ┆ mean      ┆ median    ┆ max       │
│ ---    ┆ ---  ┆ ---       ┆ ---       ┆ ---       │
│ str    ┆ f64  ┆ f64       ┆ f64       ┆ f64       │
╞════════╪══════╪═══════════╪═══════════╪═══════════╡
│ M      ┆ 16.9 ┆ 25.155118 ┆ 24.725    ┆ 40.783333 │
│ F      ┆ 21.1 ┆ 30.993951 ┆ 30.841667 ┆ 42.683333 │
└────────┴──────┴───────────┴───────────┴───────────┘

Code

# Comparing distributions
g = sns.FacetGrid(data_adults.to_pandas(), row="gender", hue="gender", aspect=2, height=3)
g.map(sns.histplot, "time", edgecolor="white")

Code

plt.show()

Code

plt.figure(figsize=(10,6)) # Added figure size for consistency
sns.kdeplot(data=data_adults.to_pandas(), x="time", hue="gender")
plt.title("Density by Gender")
plt.show()

Code

plt.figure(figsize=(10,6)) # Added figure size for consistency
sns.boxplot(data=data_adults.to_pandas(), x="time", hue="gender")
plt.title("Boxplot by Gender")
plt.show()

Code

plt.figure(figsize=(10,6)) # Added figure size for consistency
sns.violinplot(data=data_adults.to_pandas(), x="time", y="gender", hue="gender")
plt.title("Violin Plot by Gender")
plt.show()

Code

## Statistical tests
# Extracting the data
times_female_adults = data_adults.filter(pl.col("gender") == "F")["time"]
times_male_adults = data_adults.filter(pl.col("gender") == "M")["time"]

print("\n--- Female Times (Head) ---")


--- Female Times (Head) ---

Code

print(times_female_adults.head())

shape: (10,)
Series: 'time' [f64]
[
    21.1
    22.483333
    22.983333
    23.45
    23.5
    23.983333
    24.016667
    24.016667
    24.016667
    24.2
]

Code

print("\n--- Male Times (Head) ---")


--- Male Times (Head) ---

Code

print(times_male_adults.head())

shape: (10,)
Series: 'time' [f64]
[
    16.9
    17.166667
    17.3
    17.366667
    17.383333
    18.216667
    18.333333
    18.383333
    18.483333
    18.65
]

Code

t_stat, p_val = stats.ttest_ind(times_male_adults, times_female_adults)
print(f"\n--- T-Test Results ---\nt-statistic: {t_stat}\np-value: {p_val}")


--- T-Test Results ---
t-statistic: -14.627688831201171
p-value: 7.426339858226408e-41

Code

# Previous runs vs times
plt.figure(figsize=(10,6)) # Added figure size for consistency
sns.regplot(data=data_adults.to_pandas(), x="prev_runs", y="time")
plt.title("Time vs Previous Runs")
plt.show()

Code

plt.figure(figsize=(10,6)) # Added figure size for consistency
sns.lmplot(data=data_adults.to_pandas(), x="prev_runs", y="time", hue="gender")

Code

plt.title("Time vs Previous Runs by Gender")
plt.show()

Code

# Correlation
corr, p_corr = stats.pearsonr(data_adults["time"], data_adults["prev_runs"])
print(f"\n--- Correlation Test ---\ncorrelation: {corr}\np-value: {p_corr}")


--- Correlation Test ---
correlation: -0.23011066826185408
p-value: 1.0010219010793921e-07

Code

# Finding the median of prev runs
median_prev_runs = data_adults["prev_runs"].median()
print(f"\nMedian Previous Runs: {median_prev_runs}")


Median Previous Runs: 18.0

Code

data_adults_pr_gr = data_adults.with_columns(
    (pl.col("prev_runs") >= median_prev_runs).alias("pr_gr")
)

print("\n--- Data with pr_gr group ---")


--- Data with pr_gr group ---

Code

print(data_adults_pr_gr.select(["prev_runs", "pr_gr"]).head())

shape: (5, 2)
┌───────────┬───────┐
│ prev_runs ┆ pr_gr │
│ ---       ┆ ---   │
│ i64       ┆ bool  │
╞═══════════╪═══════╡
│ 60        ┆ true  │
│ 91        ┆ true  │
│ 26        ┆ true  │
│ 14        ┆ false │
│ 4         ┆ false │
└───────────┴───────┘

Code

# a quick visual check
plt.figure(figsize=(10,6))
sns.histplot(data=data_adults_pr_gr.to_pandas(), x="prev_runs", hue="pr_gr", multiple="stack")
plt.title("Previous Runs Split")
plt.show()

Code

# Comparing times
plt.figure(figsize=(10,6))
sns.boxplot(data=data_adults_pr_gr.to_pandas(), x="time", hue="pr_gr")
plt.title("Time by Previous Runs Group")
plt.show()

Code

# Linear Model
data_adults_lm = data_adults.with_columns(
    pl.col("age_cat").str.extract(r"^(\d{2})", 1).cast(pl.Float64).alias("age")
)

model = smf.ols(formula="time ~ age + gender + prev_runs", data=data_adults_lm.to_pandas())
results = model.fit()
print("\n--- Linear Model Summary ---\n")


--- Linear Model Summary ---

Code

print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   time   R-squared:                       0.322
Model:                            OLS   Adj. R-squared:                  0.318
Method:                 Least Squares   F-statistic:                     82.31
Date:                Wed, 26 Nov 2025   Prob (F-statistic):           1.39e-43
Time:                        15:24:58   Log-Likelihood:                -1526.5
No. Observations:                 524   AIC:                             3061.
Df Residuals:                     520   BIC:                             3078.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      30.1983      0.724     41.711      0.000      28.776      31.621
gender[T.M]    -5.6576      0.396    -14.292      0.000      -6.435      -4.880
age             0.0435      0.019      2.347      0.019       0.007       0.080
prev_runs      -0.0355      0.007     -4.748      0.000      -0.050      -0.021
==============================================================================
Omnibus:                       27.598   Durbin-Watson:                   0.589
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               30.512
Skew:                           0.578   Prob(JB):                     2.37e-07
Kurtosis:                       3.248   Cond. No.                         193.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Reuse

CC BY-SA 4.0

Task 1: Excel Data Analysis

Task 2: SPSS Data Analysis

Reuse

Copyright