Welcome to the Fundamentals!

This section covers the essentials you need to get started with the concepts and tools for data science.

Integrated Development Environments (IDEs)

An IDE (Integrated Development Environment) is your toolkit for writing code efficiently.

RStudio

Popular IDE specifically designed for R programming

  1. Source Editor (top-left): Write and edit R code
  2. Console (bottom-left): Code execution and results
  3. Environment/History (top-right): Variables and command history
  4. Files/Plots/Packages/Help (bottom-right): File browser and help

Tip: Customize the layout via View → Panes → Pane Layout

VS Code

Versatile IDE supporting multiple languages including R and Python

  1. Activity Bar (left): Switch between views (Explorer, Search, Source Control, Extensions)
  2. Side Bar (left): File explorer and other views
  3. Editor (left): Write your code
  4. Console (bottom): Run commands and see output
  5. Output: (right): Preview visual outputs/documents

Popular Extensions: Python, R, Pylance, Quarto

Positron

New IDE from Posit (makers of RStudio) supporting R and Python equally

Note: Currently in beta but shows great promise for bilingual data scientists

The fundamentals of R

  • how to organize your work
  • basic data types and structures
  • using R for calculations

Organizing Your Work

Project structure and file paths

my-project/
├── data/
│   ├── raw/              # Original data files
│   └── processed/        # Cleaned data
├── code/
│   ├── analysis.R
│   └── plots.R
├── outputs/
│   ├── figures/
│   └── results/
├── README.md
└── my-project.Rproj

Tips:

  • Keep your work organised for easy maintenance in project folders.
  • Use RStudio projects (.Rproj files) to manage your R work.
  • Use meaningful folder and file names.
  • Keep data separate from code and outputs.
  • Keep raw data unchanged; process copies instead.

Working with Paths in R

Relative paths (from current working directory):

# Set working directory
setwd("C:/Users/Alice/my-project")

# Read data using relative path
data <- read.csv("data/raw/mydata.csv")

Absolute paths (full path from root):

# Read using absolute path
data <- read.csv("C:/Users/Alice/my-project/data/raw/mydata.csv")

Best practice: Avoid using absolute paths when working locally.

Data Types in R

# Numeric, Integer, Character, Logical
x <- 42          # numeric
y <- 42L         # integer
name <- "Alice"  # character
flag <- TRUE     # logical

class(x)
typeof(x)

Key Types:

  • numeric: Real numbers
  • integer: Whole numbers only
  • character: Text strings
  • logical: TRUE or FALSE

Vectors in R

Sequences of values of the same type:

numbers <- c(1, 2, 3, 4, 5)
names <- c("Alice", "Bob", "Charlie")
flags <- c(TRUE, FALSE, TRUE)

first_number <- numbers[1]
first_three <- numbers[1:3]

Data Frames in R

Tables with rows and columns:

students <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  age = c(20, 21, 19),
  grade = c("A", "B", "A")
)

students$name      # access column
students[1, ]      # access row
head(students)     # view data

Lists in R

Flexible containers for mixed types:

my_list <- list(
  name = "Alice",
  age = 20,
  scores = c(85, 90, 88),
  data = data.frame(x = 1:3, y = 4:6)
)

my_list$name
my_list[[1]]

Using R

R as a Calculator

Simple operations:

2 + 3           # Addition
10 - 4          # Subtraction
5 * 6           # Multiplication
20 / 4          # Division
2 ^ 3           # Exponentiation
10 %% 3         # Modulo

See: Arithmetic Operators

Main Operators in R

Two main operators you’ll use often:

  • Assignment: <- or = to assign values to variables
# Assignment
x <- 10
y = 20
sum <- x + y
  • Pipe: |> to chain commands (introduced in R 4.1.0) read it as “then”
result <- c(1, 2, 3, 4, 5) |> # create a vector `then`
   sum() |> # calculate sum `then`
   sqrt() # calculate square root
print(result)

Subsetting data

Extract specific elements from data structures:

vec <- c(10, 20, 30, 40, 50)
first_element <- vec[1]        # 10
subset <- vec[2:4]             # 20, 30, 40
df <- data.frame(
  a = 1:5,
  b = 6:10,
 c = letters[1:5])
df$a  # access column `a`
df[1, ]   # access first row

More on subsetting: Advanced R

Control Flow

Making decisions and repeating tasks

If Statements in R

Execute code conditionally:

age <- 25

if (age >= 18) {
  print("Adult")
} else {
  print("Minor")
}

# Multiple conditions
if (age < 13) {
  category <- "Child"
} else if (age < 18) {
  category <- "Teen"
} else {
  category <- "Adult"
}

For Loops in R

Repeat code multiple times:

# Simple loop
for (i in 1:5) {
  print(i)
}

# Loop over vector
fruits <- c("apple", "banana", "cherry")
for (fruit in fruits) {
  print(paste("I like", fruit))
}

# Store results
results <- numeric(5)
for (i in 1:5) {
  results[i] <- i ^ 2
}
results  # 1, 4, 9, 16, 25

Using Packages in R

Packages are collections of functions to extend capabilities: Different types of data/sources, different methods, more efficient coding, etc.

Source: Storybench

Installing and Loading Packages

# Install once
install.packages("tidyverse")

# Load each session
# This usually goes at the top of your script/document
library(tidyverse)

Finding Documentation

?mean               # Help on mean function
help("lm")          # Help on lm function
example("plot")     # Examples for plot function

Key Takeaways

✅ Know your IDE (RStudio or VS Code)

✅ Understand basic data types in R

✅ Know the difference between data structures

✅ Organize your work with folder structure

✅ Use relative paths for portability

✅ Control program flow with if statements and loops

✅ Learn how to install and use packages

✅ Know where to find documentation

Python Content (Optional)

If you’re interested in Python, here are equivalent concepts

The Fundamentals of Python

Organising Your Work

Project structure and file paths

my-project/
├── data/
│   ├── raw/              # Original data files
│   └── processed/        # Cleaned data
├── code/
│   ├── analysis.py
│   └── plots.py
├── outputs/
│   ├── figures/
│   └── results/
├── README.md
└── requirements.txt

Tips:

  • Keep your work organised for easy maintenance
  • Use meaningful folder and file names
  • Keep data separate from code and outputs
  • Keep raw data unchanged; process copies instead

Working with Paths in Python

Relative paths (from current working directory):

import os

# Change working directory
os.chdir("C:/Users/Alice/my-project")

# Read data using relative path
import pandas as pd
data = pd.read_csv("data/raw/mydata.csv")

Absolute paths (full path from root):

data = pd.read_csv("C:/Users/Alice/my-project/data/raw/mydata.csv")

Best practice: Avoid using absolute paths when working locally.

Data Types in Python

x = 42          # int
y = 3.14        # float
name = "Alice"  # str
flag = True     # bool

type(x)
type(name)

Key Types:

  • int: Integers
  • float: Decimal numbers
  • str: Text strings
  • bool: True or False

Lists in Python

Ordered collections (can be mixed types):

numbers = [1, 2, 3, 4, 5]
names = ["Alice", "Bob", "Charlie"]
mixed = [1, "Alice", 3.14, True]

first = numbers[0]      # 0-based indexing!
first_three = numbers[0:3]

Dictionaries in Python

Key-value pairs:

student = {
    "name": "Alice",
    "age": 20,
    "grade": "A"
}

student["name"]
student.get("age")

NumPy Arrays

Similar to Python lists but more efficient for numerical operations:

import numpy as np

numbers = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2, 3], [4, 5, 6]])

first = numbers[0]

Pandas DataFrames

Tables with rows and columns:

import pandas as pd

students = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [20, 21, 19],
    "grade": ["A", "B", "A"]
})

students["name"]      # access column
students.iloc[0, :]   # access row
students.head()       # view data

Using Python

Python as a Calculator

Simple operations:

2 + 3           # Addition
10 - 4          # Subtraction
5 * 6           # Multiplication
20 / 4          # Division
2 ** 3          # Exponentiation
10 % 3          # Modulo

See: Python Operators

Main Operators in Python

Key operators you’ll use regularly:

  • Assignment: = to assign values to variables
# Assignment
x = 10
y = 20
total = x + y
  • Method chaining with . to chain operations
result = [1, 2, 3, 4, 5] # create a list
sum_result = sum(result)  # calculate sum
sqrt_result = sum_result ** 0.5  # calculate square root
print(sqrt_result)

Subsetting Data

Extract specific elements from data structures:

lst = [10, 20, 30, 40, 50]
first_element = lst[0]        # 10
subset = lst[1:4]             # [20, 30, 40]
df = pd.DataFrame({
    'a': range(1, 6),
    'b': range(6, 11),
    'c': list('abcde')})
df['a']  # access column `a`
df.iloc[0, :]   # access first row

More on subsetting here: NumPy Indexing

Control Flow

Making decisions and repeating tasks

If Statements in Python

Execute code conditionally:

age = 25

if age >= 18:
    print("Adult")
else:
    print("Minor")

# Multiple conditions
if age < 13:
    category = "Child"
elif age < 18:
    category = "Teen"
else:
    category = "Adult"

For Loops in Python

Repeat code multiple times:

# Simple loop
for i in range(1, 6):
    print(i)

# Loop over list
fruits = ["apple", "banana", "cherry"]
for fruit in fruits:
    print(f"I like {fruit}")

# Store results
results = []
for i in range(1, 6):
    results.append(i ** 2)
print(results)  # [1, 4, 9, 16, 25]

Using Packages in Python

Packages are collections of functions to extend capabilities: Different types of data/sources, different methods, more efficient coding, etc.

Installing and Importing Packages

# Install via pip (run in terminal)
# pip install pandas

# Import in your script
# This usually goes at the top of your script/notebook
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

Finding Documentation

help(sum)              # Help on sum function
help(pd.read_csv)      # Help on pandas read_csv
?np.array              # In Jupyter notebooks