3 An R Refresher

Essential Programming Skills for Longitudinal Data Analysis

Author

Published

March 18, 2026

Before we run models, we need to ensure you’re comfortable with R itself. This chapter introduces the R programming concepts you will need throughout the workshop. If you are already comfortable with R, treat it as a quick reference. If you are new to R, work through it carefully — the material here underpins everything that follows.

Running code

All code blocks in this workshop can be run interactively. Open the .qmd source file in RStudio, place your cursor inside a code chunk, and press Ctrl+Enter (Windows/Linux) or Cmd+Enter (Mac) to run a single line, or Ctrl+Shift+Enter to run the entire chunk.

Basic Operations and Assignment

R can be used as a calculator, but its real power lies in storing and manipulating values. The assignment operator <- binds a value to a name in the current environment.

1 + 3

[1] 4

# arithmetic evaluation

2^8             # exponentiation

[1] 256

17 %% 5         # modulo (remainder)

[1] 2

17 %/% 5        # integer division

[1] 3

a <- 9          # assignment: bind 9 to the name 'a'

sqrt(a)         # apply a function to an object

[1] 3

b <- sqrt(a)

a == b          # logical comparison: are they equal?

[1] FALSE

a != b          # not equal

[1] TRUE

a > 5           # greater than

[1] TRUE

ls()            # list all objects in the current environment

[1] "a"          "b"          "panel_data"

Note

R is case-sensitive: A and a are different objects. Object names can contain letters, digits, . and _, but must start with a letter or ..

Data Types

Every value in R has a type. The most common scalar types are:

class(3.14)         # numeric (double)

[1] "numeric"

class(42L)          # integer (note: I used the L suffix to tell R this is an integer)

[1] "integer"

class("hello")      # character

[1] "character"

class(TRUE)         # logical

[1] "logical"

class(2 + 3i)       # complex

[1] "complex"

# Type coercion
as.numeric("3.14")

[1] 3.14

as.integer(42)

[1] 42

as.character(100)

[1] "100"

as.logical(0)       # 0 is FALSE; anything non-zero is TRUE

[1] FALSE

is.na(NA)           # test for missing value

[1] TRUE

Vectors

A vector is the fundamental data structure in R — an ordered collection of values of the same type. Almost all operations in R are vectorised, meaning they apply element-wise without explicit loops.

x <- c(1, 3, 5)          # combine values into a vector
y <- c("one", "three", "five")

# Indexing (1-based)
x[1]                      # first element

[1] 1

x[c(1, 3)]                # first and third elements

[1] 1 5

x[-2]                     # all except the second

[1] 1 5

# Sequences
a <- 1:10
b <- seq(from = 0, to = 1, by = 0.25)
c_rep <- rep(c(1, 2), times = 3)
c_rep

[1] 1 2 1 2 1 2

# Vectorised operations — applied element-wise
x * 2

[1]  2  6 10

x + c(10, 20, 30)

[1] 11 23 35

# Logical operations on vectors
x > 2

[1] FALSE  TRUE  TRUE

any(x > 2)                # is any element > 2?

[1] TRUE

all(x > 2)                # are all elements > 2?

[1] FALSE

which(x > 2)              # indices where condition is TRUE

[1] 2 3

# Common summary functions
length(x)

[1] 3

sum(x)

[1] 9

mean(x)

[1] 3

sd(x)

[1] 2

min(x); max(x)

[1] 1

[1] 5

Matrices

A matrix is a two-dimensional vector: all elements share the same type, and values are arranged in rows and columns. In longitudinal data analysis, matrices appear in covariance and correlation structures, variance-component decompositions, and balanced panel layouts — understanding matrix indexing is useful for inspecting model objects and constructing derived quantities.

m <- matrix(1:25, nrow = 5, ncol = 5)
m

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25

# Element access
m[1, 2]                   # row 1, column 2

[1] 6

m[1, ]                    # entire first row

[1]  1  6 11 16 21

m[, 2]                    # entire second column

[1]  6  7  8  9 10

m[2:3, 3:5]               # submatrix

     [,1] [,2] [,3]
[1,]   12   17   22
[2,]   13   18   23

# Dimensions
nrow(m); ncol(m)

[1] 5

[1] 5

dim(m)

[1] 5 5

# Matrix operations — useful for covariance and correlation matrices in panel diagnostics
t(m)                      # transpose

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
[4,]   16   17   18   19   20
[5,]   21   22   23   24   25

m %*% t(m)                # matrix multiplication (not element-wise!)

     [,1] [,2] [,3] [,4] [,5]
[1,]  855  910  965 1020 1075
[2,]  910  970 1030 1090 1150
[3,]  965 1030 1095 1160 1225
[4,] 1020 1090 1160 1230 1300
[5,] 1075 1150 1225 1300 1375

diag(m)                   # diagonal elements

[1]  1  7 13 19 25

# Compute a correlation matrix — useful for checking multicollinearity among panel predictors
vars <- data.frame(
  gdp_growth = c(1.2, 2.3, 1.8, 3.1, 0.9, 2.7),
  trade_open  = c(0.4, 0.9, 0.6, 1.1, 0.3, 0.8),
  population  = c(10,  12,  11,  14,  9,   13)
)
cor(vars)                 # pairwise correlations between panel-level variables

           gdp_growth trade_open population
gdp_growth  1.0000000  0.9673915  0.9968896
trade_open  0.9673915  1.0000000  0.9605857
population  0.9968896  0.9605857  1.0000000

Lists

Lists are R’s most flexible container: each element can hold any object of any type or length. Many R functions return lists (model output from plm, lme4, or fixest, etc.), so you need to know how to navigate them.

person <- list(
  name    = "Alice",
  age     = 34,
  scores  = c(88, 92, 79),
  active  = TRUE
)

# Access by name
person$name

[1] "Alice"

person[["age"]]

[1] 34

# Access by position
person[[3]]               # third element (the scores vector)

[1] 88 92 79

person[[3]][2]            # second score

[1] 92

# Inspect structure
str(person)

List of 4
 $ name  : chr "Alice"
 $ age   : num 34
 $ scores: num [1:3] 88 92 79
 $ active: logi TRUE

length(person)

[1] 4

names(person)

[1] "name"   "age"    "scores" "active"

Data Frames

A data frame is the standard rectangular data structure in R: a list of vectors of equal length, where each vector is a column. Think of it as a spreadsheet or database table. Longitudinal datasets are naturally represented as data frames, where each row is an observation for a given unit (individual, country, firm) at a given time point.

df <- data.frame(
  id     = 1:5,
  name   = c("Alice", "Bob", "Clara", "David", "Eva"),
  degree = c(3, 7, 2, 5, 4),
  active = c(TRUE, TRUE, FALSE, TRUE, FALSE),
  stringsAsFactors = FALSE
)

df

  id  name degree active
1  1 Alice      3   TRUE
2  2   Bob      7   TRUE
3  3 Clara      2  FALSE
4  4 David      5   TRUE
5  5   Eva      4  FALSE

# Column access
df$name

[1] "Alice" "Bob"   "Clara" "David" "Eva"

df[["degree"]]

[1] 3 7 2 5 4

# Row and column indexing
df[1, ]                   # first row

  id  name degree active
1  1 Alice      3   TRUE

df[, 3]                   # third column

[1] 3 7 2 5 4

df[df$active, ]           # rows where active == TRUE

  id  name degree active
1  1 Alice      3   TRUE
2  2   Bob      7   TRUE
4  4 David      5   TRUE

df[df$degree > 3, c("name", "degree")]

   name degree
2   Bob      7
4 David      5
5   Eva      4

# Useful inspection functions
nrow(df); ncol(df)

[1] 5

[1] 4

head(df, 3)

  id  name degree active
1  1 Alice      3   TRUE
2  2   Bob      7   TRUE
3  3 Clara      2  FALSE

str(df)

'data.frame':   5 obs. of  4 variables:
 $ id    : int  1 2 3 4 5
 $ name  : chr  "Alice" "Bob" "Clara" "David" ...
 $ degree: num  3 7 2 5 4
 $ active: logi  TRUE TRUE FALSE TRUE FALSE

summary(df)

       id        name               degree      active       
 Min.   :1   Length:5           Min.   :2.0   Mode :logical  
 1st Qu.:2   Class :character   1st Qu.:3.0   FALSE:2        
 Median :3   Mode  :character   Median :4.0   TRUE :3        
 Mean   :3                      Mean   :4.2                  
 3rd Qu.:4                      3rd Qu.:5.0                  
 Max.   :5                      Max.   :7.0

Control Flow

Conditionals

x <- 15

if (x > 10) {
  cat("x is greater than 10\n")
} else if (x == 10) {
  cat("x is exactly 10\n")
} else {
  cat("x is less than 10\n")
}

x is greater than 10

# ifelse() — vectorised conditional
scores <- c(55, 72, 48, 91, 60)
ifelse(scores >= 60, "pass", "fail")

[1] "fail" "pass" "fail" "pass" "pass"

Loops

In R, explicit loops are often avoidable thanks to vectorisation, but they are useful for iterative tasks and are worth knowing.

# for loop
for (i in 1:5) {
  cat("Iteration:", i, "\n")
}

Iteration: 1 
Iteration: 2 
Iteration: 3 
Iteration: 4 
Iteration: 5

# while loop
count <- 0
while (count < 3) {
  count <- count + 1
  cat("count is", count, "\n")
}

count is 1 
count is 2 
count is 3

# break and next
for (i in 1:10) {
  if (i == 4) next       # skip this iteration
  if (i == 7) break      # exit the loop
  cat(i, "")
}

1 2 3 5 6

Apply Functions

The apply family offers a concise way to apply a function over a vector, list, or matrix — often replacing explicit loops.

m <- matrix(1:12, nrow = 3)

apply(m, 1, sum)          # row sums

[1] 22 26 30

apply(m, 2, mean)         # column means

[1]  2  5  8 11

nums <- list(a = 1:4, b = 5:10, c = 11:15)
sapply(nums, mean)        # returns a named vector

   a    b    c 
 2.5  7.5 13.0

lapply(nums, length)      # returns a list

$a
[1] 4

$b
[1] 6

$c
[1] 5

Writing Functions

Functions are the building blocks of reusable code. When you find yourself repeating the same operations, write a function.

# Basic function definition
greet <- function(name, greeting = "Hello") {
  paste(greeting, name)
}

greet("Alice")

[1] "Hello Alice"

greet("Bob", greeting = "Welcome")

[1] "Welcome Bob"

# A more practical example: normalise a vector to [0, 1]
normalise <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

gdp_growth <- c(1.3, 3.7, 0.2, 4.9, 2.4)
normalise(gdp_growth)

[1] 0.2340426 0.7446809 0.0000000 1.0000000 0.4680851

# Functions can return multiple values via a list
describe <- function(x) {
  list(n = length(x), mean = mean(x), sd = sd(x), range = range(x))
}

describe(gdp_growth)

$n
[1] 5

$mean
[1] 2.5

$sd
[1] 1.866815

$range
[1] 0.2 4.9

Data Wrangling with the Tidyverse

The tidyverse is a collection of R packages designed around a consistent philosophy of tidy data and readable code. The core package for data manipulation is dplyr, which provides a small set of verbs that cover the vast majority of data-wrangling tasks.

library(tidyverse)

The Pipe Operator

The pipe |> (base R, since 4.1) or %>% (magrittr/tidyverse) passes the result of one expression as the first argument of the next. It allows you to write a sequence of transformations in the order they happen, which is much easier to read than nested function calls.

# Without pipe — read inside out
round(mean(c(1, 2, 3, 4, 5)), digits = 2)

[1] 3

# With pipe — read left to right
c(1, 2, 3, 4, 5) |> mean() |> round(digits = 2)

[1] 3

A Working Dataset

We’ll use a small cross-sectional dataset of survey respondents, representative of the kind of individual-level data you might encounter in longitudinal research.

persons <- tibble(
  id         = 1:8,
  name       = c("Alice","Bob","Clara","David","Eva","Frank","Grace","Hugo"),
  country    = c("UK","France","UK","Germany","France","UK","Germany","France"),
  age        = c(32, 45, 28, 51, 38, 23, 47, 35),
  income     = c(42000, 68000, 35000, 72000, 55000, 28000, 61000, 49000)
)

persons

# A tibble: 8 × 5
     id name  country   age income
  <int> <chr> <chr>   <dbl>  <dbl>
1     1 Alice UK         32  42000
2     2 Bob   France     45  68000
3     3 Clara UK         28  35000
4     4 David Germany    51  72000
5     5 Eva   France     38  55000
6     6 Frank UK         23  28000
7     7 Grace Germany    47  61000
8     8 Hugo  France     35  49000

filter() — Keep Rows

# Keep only UK respondents
persons |> filter(country == "UK")

# A tibble: 3 × 5
     id name  country   age income
  <int> <chr> <chr>   <dbl>  <dbl>
1     1 Alice UK         32  42000
2     3 Clara UK         28  35000
3     6 Frank UK         23  28000

# Multiple conditions
persons |> filter(country == "France", age > 40)

# A tibble: 1 × 5
     id name  country   age income
  <int> <chr> <chr>   <dbl>  <dbl>
1     2 Bob   France     45  68000

# Using %in% for multiple values
persons |> filter(country %in% c("UK", "Germany"))

# A tibble: 5 × 5
     id name  country   age income
  <int> <chr> <chr>   <dbl>  <dbl>
1     1 Alice UK         32  42000
2     3 Clara UK         28  35000
3     4 David Germany    51  72000
4     6 Frank UK         23  28000
5     7 Grace Germany    47  61000

select() — Choose Columns

persons |> select(name, country, income)

# A tibble: 8 × 3
  name  country income
  <chr> <chr>    <dbl>
1 Alice UK       42000
2 Bob   France   68000
3 Clara UK       35000
4 David Germany  72000
5 Eva   France   55000
6 Frank UK       28000
7 Grace Germany  61000
8 Hugo  France   49000

# Drop a column with -
persons |> select(-id)

# A tibble: 8 × 4
  name  country   age income
  <chr> <chr>   <dbl>  <dbl>
1 Alice UK         32  42000
2 Bob   France     45  68000
3 Clara UK         28  35000
4 David Germany    51  72000
5 Eva   France     38  55000
6 Frank UK         23  28000
7 Grace Germany    47  61000
8 Hugo  France     35  49000

# Rename while selecting
persons |> select(name, cntry = country, annual_income = income)

# A tibble: 8 × 3
  name  cntry   annual_income
  <chr> <chr>           <dbl>
1 Alice UK              42000
2 Bob   France          68000
3 Clara UK              35000
4 David Germany         72000
5 Eva   France          55000
6 Frank UK              28000
7 Grace Germany         61000
8 Hugo  France          49000

mutate() — Create or Modify Columns

persons |>
  mutate(
    older       = age >= 40,
    income_z    = (income - mean(income)) / sd(income),   # standardise
    label       = paste0(name, " (", country, ")")
  )

# A tibble: 8 × 8
     id name  country   age income older income_z label          
  <int> <chr> <chr>   <dbl>  <dbl> <lgl>    <dbl> <chr>          
1     1 Alice UK         32  42000 FALSE   -0.591 Alice (UK)     
2     2 Bob   France     45  68000 TRUE     1.07  Bob (France)   
3     3 Clara UK         28  35000 FALSE   -1.04  Clara (UK)     
4     4 David Germany    51  72000 TRUE     1.33  David (Germany)
5     5 Eva   France     38  55000 FALSE    0.240 Eva (France)   
6     6 Frank UK         23  28000 FALSE   -1.49  Frank (UK)     
7     7 Grace Germany    47  61000 TRUE     0.623 Grace (Germany)
8     8 Hugo  France     35  49000 FALSE   -0.144 Hugo (France)

arrange() — Sort Rows

persons |> arrange(desc(income))       # highest income first

# A tibble: 8 × 5
     id name  country   age income
  <int> <chr> <chr>   <dbl>  <dbl>
1     4 David Germany    51  72000
2     2 Bob   France     45  68000
3     7 Grace Germany    47  61000
4     5 Eva   France     38  55000
5     8 Hugo  France     35  49000
6     1 Alice UK         32  42000
7     3 Clara UK         28  35000
8     6 Frank UK         23  28000

persons |> arrange(country, age)       # sort by two columns

# A tibble: 8 × 5
     id name  country   age income
  <int> <chr> <chr>   <dbl>  <dbl>
1     8 Hugo  France     35  49000
2     5 Eva   France     38  55000
3     2 Bob   France     45  68000
4     7 Grace Germany    47  61000
5     4 David Germany    51  72000
6     6 Frank UK         23  28000
7     3 Clara UK         28  35000
8     1 Alice UK         32  42000

summarise() and group_by() — Aggregation

# Overall summary
persons |>
  summarise(
    n           = n(),
    mean_income = mean(income),
    sd_income   = sd(income),
    max_age     = max(age)
  )

# A tibble: 1 × 4
      n mean_income sd_income max_age
  <int>       <dbl>     <dbl>   <dbl>
1     8       51250    15655.      51

# Summary by group — useful for comparing units in panel data
persons |>
  group_by(country) |>
  summarise(
    n           = n(),
    mean_income = mean(income),
    mean_age    = mean(age)
  ) |>
  arrange(desc(mean_income))

# A tibble: 3 × 4
  country     n mean_income mean_age
  <chr>   <int>       <dbl>    <dbl>
1 Germany     2      66500      49  
2 France      3      57333.     39.3
3 UK          3      35000      27.7

Joining Tables

Joins combine two data frames on a shared key column — essential in longitudinal work when merging panel waves with time-invariant attributes, or linking datasets from different sources.

# Panel waves: repeated observations per person across two years
waves <- tibble(
  id   = c(1, 1, 2, 2, 3, 3, 4, 4),
  year = c(2020, 2021, 2020, 2021, 2020, 2021, 2020, 2021),
  income_obs = c(40000, 42000, 65000, 68000, 33000, 35000, 70000, 72000)
)

# Attach time-invariant person attributes to the panel
waves |>
  left_join(persons |> select(id, name, country),
            by = "id") |>
  arrange(id, year)

# A tibble: 8 × 5
     id  year income_obs name  country
  <dbl> <dbl>      <dbl> <chr> <chr>  
1     1  2020      40000 Alice UK     
2     1  2021      42000 Alice UK     
3     2  2020      65000 Bob   France 
4     2  2021      68000 Bob   France 
5     3  2020      33000 Clara UK     
6     3  2021      35000 Clara UK     
7     4  2020      70000 David Germany
8     4  2021      72000 David Germany

Reshaping Data

# Wide to long (pivot_longer) — core operation when converting panel data to long format
persons |>
  select(name, age, income) |>
  pivot_longer(cols = c(age, income),
               names_to  = "variable",
               values_to = "value")

# A tibble: 16 × 3
   name  variable value
   <chr> <chr>    <dbl>
 1 Alice age         32
 2 Alice income   42000
 3 Bob   age         45
 4 Bob   income   68000
 5 Clara age         28
 6 Clara income   35000
 7 David age         51
 8 David income   72000
 9 Eva   age         38
10 Eva   income   55000
11 Frank age         23
12 Frank income   28000
13 Grace age         47
14 Grace income   61000
15 Hugo  age         35
16 Hugo  income   49000

# Long to wide (pivot_wider) — useful when reshaping repeated measures to one row per unit
waves |>
  select(id, year, income_obs) |>
  pivot_wider(names_from = year, values_from = income_obs,
              names_prefix = "income_")

# A tibble: 4 × 3
     id income_2020 income_2021
  <dbl>       <dbl>       <dbl>
1     1       40000       42000
2     2       65000       68000
3     3       33000       35000
4     4       70000       72000

Reading and Writing Data

# CSV files
df <- read_csv("data/panel_data.csv")       # tidyverse (recommended)
df <- read.csv("data/panel_data.csv")       # base R

write_csv(df, "data/panel_data_clean.csv")

# Excel files
library(readxl)
df <- read_excel("data/data.xlsx", sheet = 1)

# SPSS, Stata, SAS
library(haven)
df <- read_spss("data/survey.sav")
df <- read_dta("data/survey.dta")

# R's native format
saveRDS(df, "data/df.rds")
df <- readRDS("data/df.rds")

Quick Reference

Task	Function
Assign a value	`x <- value`
Create a vector	`c(1, 2, 3)`
Create a sequence	`1:10`, `seq(0, 1, 0.1)`
Create a matrix	`matrix(data, nrow, ncol)`
Create a data frame	`data.frame()` / `tibble()`
Check an object’s type	`class()`, `typeof()`
Check dimensions	`dim()`, `nrow()`, `ncol()`, `length()`
Inspect structure	`str()`, `summary()`, `head()`
Filter rows	`filter()`
Select columns	`select()`
Create columns	`mutate()`
Sort rows	`arrange()`
Aggregate	`group_by()` + `summarise()`
Join tables	`left_join()`, `inner_join()`
Reshape wide → long	`pivot_longer()`
Reshape long → wide	`pivot_wider()`