Welcome to the Fundamentals!

This section covers the essentials you need to get started with the concepts and tools for data science.

Why Data Science?

The demand for data science skills is growing rapidly:

High Demand: Data scientist positions are among the fastest-growing jobs globally.
Market Growth: The global data science market is projected to reach $178.5 billion by 2025.

Source: US Bureau of Labor Statistics, IBM, World Economic Forum

Future Outlook

The trend is expected to continue:

35% increase in job openings projected from 2022 to 2032 (US Bureau of Labor Statistics).
40% increase in demand for AI and Machine Learning specialists anticipated by 2027.

Source: US Bureau of Labor Statistics, World Economic Forum

Integrated Development Environments (IDEs)

An IDE (Integrated Development Environment) is your toolkit for writing code efficiently.

RStudio

Popular IDE specifically designed for R programming

Source Editor (top-left): Write and edit R code
Console (bottom-left): Code execution and results
Environment/History (top-right): Variables and command history
Files/Plots/Packages/Help (bottom-right): File browser and help

Tip: Customize the layout via View → Panes → Pane Layout

VS Code

Versatile IDE supporting multiple languages including R and Python

Activity Bar (left): Switch between views (Explorer, Search, Source Control, Extensions)
Side Bar (left): File explorer and other views
Editor (left): Write your code
Console (bottom): Run commands and see output
Output: (right): Preview visual outputs/documents

Popular Extensions: Python, R, Pylance, Quarto

Positron

New IDE from Posit (makers of RStudio) supporting R and Python equally

Note: Currently in beta but shows great promise for bilingual data scientists

The fundamentals of R

how to organize your work
basic data types and structures
using R for calculations

Organizing Your Work

Project structure and file paths

Recommended Folder Structure

my-project/
├── data/
│   ├── raw/              # Original data files
│   └── processed/        # Cleaned data
├── code/
│   ├── analysis.R
│   └── plots.R
├── outputs/
│   ├── figures/
│   └── results/
├── README.md
└── my-project.Rproj

Tips:

Keep your work organised for easy maintenance in project folders.
Use RStudio projects (.Rproj files) to manage your R work.
Use meaningful folder and file names.
Keep data separate from code and outputs.
Keep raw data unchanged; process copies instead.

Working with Paths in R

Relative paths (from current working directory):

# Set working directory
setwd("C:/Users/Alice/my-project")

# Read data using relative path
data <- read.csv("data/raw/mydata.csv")

Absolute paths (full path from root):

# Read using absolute path
data <- read.csv("C:/Users/Alice/my-project/data/raw/mydata.csv")

Best practice: Avoid using absolute paths when working locally.

Data Types in R

# Numeric, Integer, Character, Logical
x <- 42          # numeric
y <- 42L         # integer
name <- "Alice"  # character
flag <- TRUE     # logical

class(x)
typeof(x)

Key Types:

numeric: Real numbers
integer: Whole numbers only
character: Text strings
logical: TRUE or FALSE

Vectors in R

Sequences of values of the same type:

numbers <- c(1, 2, 3, 4, 5)
names <- c("Alice", "Bob", "Charlie")
flags <- c(TRUE, FALSE, TRUE)

first_number <- numbers[1]
first_three <- numbers[1:3]

Data Frames in R

Tables with rows and columns:

students <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  age = c(20, 21, 19),
  grade = c("A", "B", "A")
)

students$name      # access column
students[1, ]      # access row
head(students)     # view data

Lists in R

Flexible containers for mixed types:

my_list <- list(
  name = "Alice",
  age = 20,
  scores = c(85, 90, 88),
  data = data.frame(x = 1:3, y = 4:6)
)

my_list$name
my_list[[1]]

Using R

R as a Calculator

Simple operations:

2 + 3           # Addition
10 - 4          # Subtraction
5 * 6           # Multiplication
20 / 4          # Division
2 ^ 3           # Exponentiation
10 %% 3         # Modulo

See: Arithmetic Operators

Main Operators in R

Two main operators you’ll use often:

Assignment: <- or = to assign values to variables

# Assignment
x <- 10
y = 20
sum <- x + y

Pipe: |> to chain commands (introduced in R 4.1.0) read it as “then”

result <- c(1, 2, 3, 4, 5) |> # create a vector `then`
   sum() |> # calculate sum `then`
   sqrt() # calculate square root
print(result)

Subsetting data

Extract specific elements from data structures:

vec <- c(10, 20, 30, 40, 50)
first_element <- vec[1]        # 10
subset <- vec[2:4]             # 20, 30, 40
df <- data.frame(
  a = 1:5,
  b = 6:10,
 c = letters[1:5])
df$a  # access column `a`
df[1, ]   # access first row

Control Flow

Making decisions and repeating tasks

If Statements in R

Execute code conditionally:

age <- 25

if (age >= 18) {
  print("Adult")
} else {
  print("Minor")
}

# Multiple conditions
if (age < 13) {
  category <- "Child"
} else if (age < 18) {
  category <- "Teen"
} else {
  category <- "Adult"
}

For Loops in R

Repeat code multiple times:

# Simple loop
for (i in 1:5) {
  print(i)
}

# Loop over vector
fruits <- c("apple", "banana", "cherry")
for (fruit in fruits) {
  print(paste("I like", fruit))
}

# Store results
results <- numeric(5)
for (i in 1:5) {
  results[i] <- i ^ 2
}
results  # 1, 4, 9, 16, 25

Using Packages in R

Packages are collections of functions to extend capabilities: Different types of data/sources, different methods, more efficient coding, etc.

Source: Storybench

Installing and Loading Packages

# Install once
install.packages("tidyverse")

# Load each session
# This usually goes at the top of your script/document
library(tidyverse)

Finding Documentation

?mean               # Help on mean function
help("lm")          # Help on lm function
example("plot")     # Examples for plot function

Key Takeaways

✅ Know your IDE (RStudio or VS Code)

✅ Understand basic data types in R

✅ Know the difference between data structures

✅ Organize your work with folder structure

✅ Use relative paths for portability

✅ Control program flow with if statements and loops

✅ Learn how to install and use packages

✅ Know where to find documentation

Python Content (Optional)

If you’re interested in Python, here are equivalent concepts

The Fundamentals of Python

Organising Your Work

Project structure and file paths

Recommended Folder Structure

my-project/
├── data/
│   ├── raw/              # Original data files
│   └── processed/        # Cleaned data
├── code/
│   ├── analysis.py
│   └── plots.py
├── outputs/
│   ├── figures/
│   └── results/
├── README.md
└── requirements.txt

Tips:

Keep your work organised for easy maintenance
Use meaningful folder and file names
Keep data separate from code and outputs
Keep raw data unchanged; process copies instead

Working with Paths in Python

Relative paths (from current working directory):

import os

# Change working directory
os.chdir("C:/Users/Alice/my-project")

# Read data using relative path
import pandas as pd
data = pd.read_csv("data/raw/mydata.csv")

Absolute paths (full path from root):

data = pd.read_csv("C:/Users/Alice/my-project/data/raw/mydata.csv")

Best practice: Avoid using absolute paths when working locally.

Data Types in Python

x = 42          # int
y = 3.14        # float
name = "Alice"  # str
flag = True     # bool

type(x)
type(name)

Key Types:

int: Integers
float: Decimal numbers
str: Text strings
bool: True or False

Lists in Python

Ordered collections (can be mixed types):

numbers = [1, 2, 3, 4, 5]
names = ["Alice", "Bob", "Charlie"]
mixed = [1, "Alice", 3.14, True]

first = numbers[0]      # 0-based indexing!
first_three = numbers[0:3]

Dictionaries in Python

Key-value pairs:

student = {
    "name": "Alice",
    "age": 20,
    "grade": "A"
}

student["name"]
student.get("age")

NumPy Arrays

Similar to Python lists but more efficient for numerical operations:

import numpy as np

numbers = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2, 3], [4, 5, 6]])

first = numbers[0]

Pandas DataFrames

Tables with rows and columns:

import pandas as pd

students = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [20, 21, 19],
    "grade": ["A", "B", "A"]
})

students["name"]      # access column
students.iloc[0, :]   # access row
students.head()       # view data

Using Python

Python as a Calculator

Simple operations:

2 + 3           # Addition
10 - 4          # Subtraction
5 * 6           # Multiplication
20 / 4          # Division
2 ** 3          # Exponentiation
10 % 3          # Modulo

See: Python Operators

Main Operators in Python

Key operators you’ll use regularly:

Assignment: = to assign values to variables

# Assignment
x = 10
y = 20
total = x + y

Method chaining with . to chain operations

result = [1, 2, 3, 4, 5] # create a list
sum_result = sum(result)  # calculate sum
sqrt_result = sum_result ** 0.5  # calculate square root
print(sqrt_result)

Subsetting Data

Extract specific elements from data structures:

lst = [10, 20, 30, 40, 50]
first_element = lst[0]        # 10
subset = lst[1:4]             # [20, 30, 40]
df = pd.DataFrame({
    'a': range(1, 6),
    'b': range(6, 11),
    'c': list('abcde')})
df['a']  # access column `a`
df.iloc[0, :]   # access first row

Control Flow

Making decisions and repeating tasks

If Statements in Python

Execute code conditionally:

age = 25

if age >= 18:
    print("Adult")
else:
    print("Minor")

# Multiple conditions
if age < 13:
    category = "Child"
elif age < 18:
    category = "Teen"
else:
    category = "Adult"

For Loops in Python

Repeat code multiple times:

# Simple loop
for i in range(1, 6):
    print(i)

# Loop over list
fruits = ["apple", "banana", "cherry"]
for fruit in fruits:
    print(f"I like {fruit}")

# Store results
results = []
for i in range(1, 6):
    results.append(i ** 2)
print(results)  # [1, 4, 9, 16, 25]

Using Packages in Python

Packages are collections of functions to extend capabilities: Different types of data/sources, different methods, more efficient coding, etc.

Installing and Importing Packages

# Install via pip (run in terminal)
# pip install pandas

# Import in your script
# This usually goes at the top of your script/notebook
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

Finding Documentation

help(sum)              # Help on sum function
help(pd.read_csv)      # Help on pandas read_csv
?np.array              # In Jupyter notebooks