4 Working with Panel Data in R

Structure, Reshaping, Indexing, and Missingness

Author

Published

April 16, 2026

Before estimating any model, we need to understand the two fundamental shapes that panel data can take, know how to move between them, and be able to detect and address the most common data quality problems. This chapter covers all three.

Two Shapes of Panel Data

Panel data — repeated observations on the same units over time — can be stored in two distinct formats. Understanding the difference is essential before you can do anything else.

Person-level (wide) format

In wide format, each row represents one unit (a person, firm, or country) and each wave of measurement occupies a separate column. This is how many surveys distribute their data and how non-specialists tend to organise spreadsheets.

wide_example <- tibble(
  id     = 1:4,
  gender = c("F", "M", "F", "M"),
  wage_2020 = c(2.1, 3.4, 1.8, 2.9),
  wage_2021 = c(2.3, 3.6, 1.9, 3.1),
  wage_2022 = c(2.4, 3.5, 2.2, 3.3),
  educ_2020 = c(12, 16, 14, 12),
  educ_2021 = c(12, 16, 16, 12),
  educ_2022 = c(14, 16, 16, 14)
)

wide_example

# A tibble: 4 × 8
     id gender wage_2020 wage_2021 wage_2022 educ_2020 educ_2021
  <int> <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1     1 F            2.1       2.3       2.4        12        12
2     2 M            3.4       3.6       3.5        16        16
3     3 F            1.8       1.9       2.2        14        16
4     4 M            2.9       3.1       3.3        12        12
# ℹ 1 more variable: educ_2022 <dbl>

Each person appears on exactly one row. Repeated measurements are spread across columns named by variable and wave (wage_2020, wage_2021, …). This makes it easy to compute within-person changes (subtract column A from column B), but it is awkward for modelling: most statistical functions in R expect one observation per row.

Person-period (long) format

In long format, each row is one observation — one unit at one point in time. A person observed at four time points occupies four rows.

long_example <- wide_example |>
  pivot_longer(
    cols      = starts_with("wage_") | starts_with("educ_"),
    names_to  = c(".value", "year"),
    names_sep = "_"
  ) |>
  mutate(year = as.integer(year))

long_example

# A tibble: 12 × 5
      id gender  year  wage  educ
   <int> <chr>  <int> <dbl> <dbl>
 1     1 F       2020   2.1    12
 2     1 F       2021   2.3    12
 3     1 F       2022   2.4    14
 4     2 M       2020   3.4    16
 5     2 M       2021   3.6    16
 6     2 M       2022   3.5    16
 7     3 F       2020   1.8    14
 8     3 F       2021   1.9    16
 9     3 F       2022   2.2    16
10     4 M       2020   2.9    12
11     4 M       2021   3.1    12
12     4 M       2022   3.3    14

This is the format required by virtually all panel model packages in R (plm, fixest, lme4). Each row has a unit identifier (id), a time identifier (year), and then the variables measured at that wave.

Which format does your data arrive in?

Survey data is often wide: one row per respondent, with wave-specific suffixes. Administrative records are usually long: each event or measurement is a separate record. Knowing which you have determines your first step.

Reading Panel Data into R

CSV and tabular data

The workshop dataset — 200 individuals tracked annually from 2016 to 2023 — is stored as a CSV:

panel_data <- read_csv("files/panel_data.csv", show_col_types = FALSE)

glimpse(panel_data)

Rows: 1,600
Columns: 11
$ id         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,…
$ year       <dbl> 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023,…
$ period     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8,…
$ gender     <chr> "Female", "Female", "Female", "Female", "Female…
$ sector     <chr> "Public", "Public", "Public", "Public", "Public…
$ education  <dbl> 10, 10, 11, 11, 11, 12, 12, 12, 11, 11, 12, 12,…
$ experience <dbl> 6, 7, 8, 9, 10, 11, 12, 13, 6, 7, 8, 9, 10, 11,…
$ union      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,…
$ ability    <dbl> -2.0009292, -2.0009292, -2.0009292, -2.0009292,…
$ wage       <dbl> 5.883, 8.100, 7.670, 6.753, 7.688, 10.231, NA, …
$ log_wage   <dbl> 1.772119, 2.091868, 2.037348, 1.909919, 2.03972…

The dataset is already in long format: each row is one person-year. The key identifiers are id (person) and year (time).

Stata and SPSS files

Many longitudinal datasets are distributed in Stata (.dta) or SPSS (.sav) format. The haven package reads both directly, preserving value labels:

library(haven)

# Stata
df_stata <- read_dta("your_data.dta")

# SPSS
df_spss  <- read_sav("your_data.sav")

# Convert labelled columns to factors
df_stata <- df_stata |> mutate(across(where(is.labelled), as_factor))

Excel files

library(readxl)

df_xl <- read_excel("your_data.xlsx", sheet = "Panel", na = c("", "NA", "."))

Inspecting the Panel Structure

Before modelling, always verify that your data is what you think it is.

How many units? How many periods?

cat("Individuals :", n_distinct(panel_data$id),   "\n")

Individuals : 200

cat("Time periods:", n_distinct(panel_data$year),  "\n")

Time periods: 8

cat("Total rows  :", nrow(panel_data),             "\n")

Total rows  : 1600

cat("Year range  :", min(panel_data$year), "–", max(panel_data$year), "\n")

Year range  : 2016 – 2023

Is the panel balanced?

A balanced panel has the same number of time points for every unit. An unbalanced panel does not — some units are observed less frequently. The distinction matters for estimation.

obs_per_person <- panel_data |>
  count(id, name = "n_obs")

obs_per_person |>
  count(n_obs, name = "n_individuals") |>
  kable(
    col.names = c("Observations per person", "Number of individuals"),
    caption   = "Table 1 — Panel balance check"
  )

Table 1 — Panel balance check
Observations per person	Number of individuals
8	200

ggplot(obs_per_person, aes(x = n_obs)) +
  geom_bar(fill = "#045a8d", width = 0.6) +
  scale_x_continuous(breaks = 1:10) +
  labs(x = "Observations per individual", y = "Count",
       title = "A balanced panel: every individual has exactly 8 observations") +
  theme_minimal(base_size = 13)

Figure 1 — Number of observations per individual (balanced panel)

Unit and time coverage at a glance

For small panels it can be useful to visualise which units are observed at which times — a “swimmer plot” or tile chart that reveals gaps immediately.

set.seed(7)
sample_ids <- sample(unique(panel_data$id), 30)

panel_data |>
  filter(id %in% sample_ids) |>
  mutate(id = factor(id, levels = rev(sort(sample_ids)))) |>
  ggplot(aes(x = year, y = id)) +
  geom_tile(fill = "#045a8d", colour = "white", linewidth = 0.3) +
  scale_x_continuous(breaks = 2016:2023) +
  labs(x = "Year", y = "Individual ID",
       title = "Observation grid — filled = observed, gap = missing") +
  theme_minimal(base_size = 11) +
  theme(axis.text.y = element_text(size = 7),
        panel.grid  = element_blank())

Figure 2 — Observation coverage for a random sample of 30 individuals

Cleaning Panel Data

Standardise variable names

The janitor package’s clean_names() converts all column names to snake_case, removes special characters, and ensures uniqueness. Get into the habit of calling it immediately after reading data.

# Requires: install.packages("janitor")
library(janitor)

# Simulate messy column names from an external file
messy <- tibble(
  `Person ID` = 1:3,
  `Survey Year` = 2020:2022,
  `Log Wage (£)` = c(7.2, 7.4, 7.3),
  `Union Member?` = c(1, 0, 1)
)

messy |> clean_names()

Parse and handle date variables

Many longitudinal datasets record time as a date string rather than an integer year. lubridate provides a consistent grammar for parsing and computing with dates.

date_example <- tibble(
  id           = c(1, 1, 2, 2),
  interview_dt = c("2020-03-15", "2022-09-01", "2020-06-22", "2022-11-30")
)

date_example <- date_example |>
  mutate(
    interview_dt = ymd(interview_dt),           # parse to Date
    year         = year(interview_dt),          # extract year
    month        = month(interview_dt, label = TRUE),
    wave_gap_days = as.integer(
      interview_dt - lag(interview_dt), units = "days"
    )
  )

date_example

# A tibble: 4 × 5
     id interview_dt  year month wave_gap_days
  <dbl> <date>       <dbl> <ord>         <int>
1     1 2020-03-15    2020 Mar               0
2     1 2022-09-01    2022 Sep               0
3     2 2020-06-22    2020 Jun               0
4     2 2022-11-30    2022 Nov               0

Recode and relabel variables

panel_data <- panel_data |>
  mutate(
    gender    = factor(gender, levels = c("Male", "Female")),
    sector    = factor(sector, levels = c("Public", "Private")),
    union_lbl = if_else(union == 1, "Union member", "Non-member")
  )

panel_data |>
  count(gender, sector) |>
  kable(caption = "Table 2 — Cross-tabulation: gender × sector")

Table 2 — Cross-tabulation: gender × sector
gender	sector	n
Male	Public	224
Male	Private	528
Female	Public	240
Female	Private	608

Reshaping: Wide ↔︎ Long

Long to wide with `pivot_wider`

Sometimes you need wide format — for example to compute person-level change scores or to merge with a cross-sectional dataset.

wages_wide <- panel_data |>
  select(id, year, wage) |>
  pivot_wider(
    names_from  = year,
    values_from = wage,
    names_prefix = "wage_"
  )

head(wages_wide, 5)

# A tibble: 5 × 9
     id wage_2016 wage_2017 wage_2018 wage_2019 wage_2020 wage_2021
  <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1     1      5.88      8.1       7.67      6.75      7.69     10.2 
2     2     17.2      19.0      13.8      18.3      17.1      23.2 
3     3     25.5      20.2      21.6      NA        27.4      32.5 
4     4     33.4      31.6      38.9      42.2      40.4      49.1 
5     5      6.36      7.32      8.17     10.2       6.18      6.91
# ℹ 2 more variables: wage_2022 <dbl>, wage_2023 <dbl>

You can now compute within-person change between any two waves:

wages_wide <- wages_wide |>
  mutate(wage_change_16_23 = wage_2023 - wage_2016)

wages_wide |>
  select(id, wage_2016, wage_2023, wage_change_16_23) |>
  slice_head(n = 6) |>
  kable(
    col.names = c("ID", "Wage 2016", "Wage 2023", "Change 2016→2023"),
    caption   = "Table 3 — Within-person wage change (wide format)"
  )

Table 3 — Within-person wage change (wide format)
ID	Wage 2016	Wage 2023	Change 2016→2023
1	5.883	8.301	2.418
2	17.217	18.898	1.681
3	25.461	26.789	1.328
4	33.405	46.068	12.663
5	6.365	9.512	3.147
6	9.270	8.428	-0.842

Wide to long with `pivot_longer`

To go back to long format, or when your data arrives wide:

wages_long_again <- wages_wide |>
  select(id, starts_with("wage_20")) |>      # drop the change score
  pivot_longer(
    cols         = starts_with("wage_20"),
    names_to     = "year",
    names_prefix = "wage_",
    values_to    = "wage"
  ) |>
  mutate(year = as.integer(year))

head(wages_long_again, 8)

# A tibble: 8 × 3
     id  year  wage
  <dbl> <int> <dbl>
1     1  2016  5.88
2     1  2017  8.1 
3     1  2018  7.67
4     1  2019  6.75
5     1  2020  7.69
6     1  2021 10.2 
7     1  2022 NA   
8     1  2023  8.30

Reshaping multiple variables simultaneously

When several variables are wave-stamped (e.g., wage_2020, educ_2020, wage_2021, educ_2021), use the ".value" sentinel in names_to to split the column name into variable name and wave:

# Reconstruct a multi-variable wide dataset to demonstrate
multi_wide <- panel_data |>
  select(id, year, wage, education) |>
  filter(id <= 3) |>
  pivot_wider(
    names_from  = year,
    values_from = c(wage, education)
  )

# Pivot back to long, splitting "wage_2016" into variable="wage", year="2016"
multi_wide |>
  pivot_longer(
    cols      = -id,
    names_to  = c(".value", "year"),
    names_sep = "_(?=\\d{4}$)"           # split on _ before a 4-digit year
  ) |>
  mutate(year = as.integer(year)) |>
  arrange(id, year)

# A tibble: 24 × 4
      id  year  wage education
   <dbl> <int> <dbl>     <dbl>
 1     1  2016  5.88        10
 2     1  2017  8.1         10
 3     1  2018  7.67        11
 4     1  2019  6.75        11
 5     1  2020  7.69        11
 6     1  2021 10.2         12
 7     1  2022 NA           12
 8     1  2023  8.30        12
 9     2  2016 17.2         11
10     2  2017 19.0         11
# ℹ 14 more rows

Indexing: Creating a `pdata.frame`

The plm package works with panel data frames — objects that know which columns identify units and time. Creating one is the gateway to all of plm’s modelling and diagnostic functions.

pdata <- pdata.frame(panel_data, index = c("id", "year"))

# The class is both a pdata.frame and a data.frame
class(pdata)

[1] "pdata.frame" "data.frame"

Inspecting the panel index

pdim(pdata)

Balanced Panel: n = 200, T = 8, N = 1600

pdim() reports the panel dimensions and whether it is balanced. The T here is the number of time periods, N the number of individuals.

Within and between variation

A central diagnostic in panel analysis is how much variation in your key variables is within units (across time) versus between units (across individuals). plm::Within() and between() compute these directly.

within_between <- panel_data |>
  summarise(
    across(
      c(wage, education, experience),
      list(
        total_sd  = \(x) round(sd(x, na.rm = TRUE), 3),
        between_sd = \(x) round(sd(tapply(x, id, mean, na.rm = TRUE),
                                  na.rm = TRUE), 3),
        within_sd  = \(x) round(mean(tapply(x, id,
                                  \(v) sd(v, na.rm = TRUE),
                                  simplify = TRUE), na.rm = TRUE), 3)
      )
    )
  ) |>
  pivot_longer(everything(), names_to = c("Variable", ".value"),
               names_sep = "_(?=total|between|within)") |>
  rename(`Total SD` = total_sd,
         `Between SD` = between_sd,
         `Within SD` = within_sd)

kable(within_between,
      caption = "Table 4 — Decomposition of variance: total, between, and within")

Table 4 — Decomposition of variance: total, between, and within
Variable	Total SD	Between SD	Within SD
wage	9.933	9.338	3.355
education	2.495	2.376	0.830
experience	4.779	4.203	2.449

Reading the variance decomposition

Between SD — how much the person-average of a variable varies across individuals. This is the variation exploited by cross-sectional (OLS) models.
Within SD — how much a variable varies within the same individual over time. This is the variation used by the Fixed Effects estimator.

A variable with low within-SD and high between-SD (e.g., gender) carries little information for a within-estimator. A variable with high within-SD (e.g., experience) is well-suited to fixed effects analysis.

within_between |>
  pivot_longer(c(`Between SD`, `Within SD`),
               names_to = "Component", values_to = "SD") |>
  ggplot(aes(x = Variable, y = SD, fill = Component)) +
  geom_col(position = "dodge", width = 0.6) +
  scale_fill_manual(values = c("Between SD" = "#045a8d",
                               "Within SD"  = "#74add1")) +
  labs(x = NULL, y = "Standard deviation", fill = NULL,
       title = "Within vs between variation",
       subtitle = "FE uses within; OLS mixes both") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

Figure 3 — Within vs between standard deviation for key variables

Handling Missingness

Real panel datasets almost always have missing values. The source and pattern of missingness determines how seriously it threatens your estimates.

Types of missingness

Type	Definition	Implication
MCAR	Missing completely at random — missingness is unrelated to any variable	Listwise deletion is unbiased (just less efficient)
MAR	Missing at random — missingness depends on observed variables	Multiple imputation or FIML can recover unbiased estimates
MNAR	Missing not at random — missingness depends on the unobserved value itself	Requires a selection model; listwise deletion is biased

Detecting missingness

missing_summary <- panel_data |>
  summarise(across(everything(),
                   \(x) sum(is.na(x)))) |>
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "N missing") |>
  mutate(`% missing` = round(`N missing` / nrow(panel_data) * 100, 2)) |>
  filter(`N missing` > 0)

kable(missing_summary,
      caption = "Table 5 — Missing values by variable")

Table 5 — Missing values by variable
Variable	N missing	% missing
wage	48	3
log_wage	48	3

Is missingness systematic?

The first question to ask: is missingness in the outcome related to observed predictors? If it is, listwise deletion may introduce bias.

panel_data |>
  arrange(id, year) |>
  group_by(id) |>
  mutate(
    missing_next_wage = is.na(dplyr::lead(wage))
  ) |>
  ungroup() |>
  filter(!is.na(missing_next_wage)) |>
  group_by(year, missing_next_wage) |>
  summarise(mean_wage = mean(wage, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = year, y = mean_wage,
             colour = missing_next_wage, group = missing_next_wage)) +
  geom_line(linewidth = 1.1) +
  geom_point(size = 3) +
  scale_colour_manual(values = c("FALSE" = "#045a8d", "TRUE" = "#d73027"),
                      labels  = c("Wage observed next year",
                                  "Wage missing next year")) +
  labs(x = NULL, y = "Mean wage (£'000s/month)", colour = NULL,
       title = "Do lower earners drop out more often?",
       subtitle = "Red = individuals who are missing the following year") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

Figure 4 — Mean wage by year and missingness in the following year

Listwise deletion (complete cases)

The default in most R model functions. Observations with any NA on the modelled variables are dropped. This is valid only under MCAR.

panel_complete <- panel_data |> drop_na(wage, education, experience, union)

cat("Rows before:", nrow(panel_data),    "\n")

Rows before: 1600

cat("Rows after :", nrow(panel_complete), "\n")

Rows after : 1552

cat("Dropped    :", nrow(panel_data) - nrow(panel_complete), "\n")

Dropped    : 48

Carrying observations forward (`fill`)

In some panel datasets, a variable is recorded only when it changes (e.g., a diagnosis or employment status). tidyr::fill() propagates the last observed value forward (or backward) within each unit.

fill_example <- tibble(
  id     = c(1, 1, 1, 1, 2, 2, 2, 2),
  year   = rep(2020:2023, 2),
  status = c("Employed", NA, NA, "Unemployed",
             "Student",  NA, "Employed", NA)
)

fill_example |>
  group_by(id) |>
  fill(status, .direction = "down") |>
  ungroup()

# A tibble: 8 × 3
     id  year status    
  <dbl> <int> <chr>     
1     1  2020 Employed  
2     1  2021 Employed  
3     1  2022 Employed  
4     1  2023 Unemployed
5     2  2020 Student   
6     2  2021 Student   
7     2  2022 Employed  
8     2  2023 Employed

Checking panel balance after dropping missing rows

Dropping rows due to missingness can unbalance a previously balanced panel. Always recheck after listwise deletion:

pdata_complete <- pdata.frame(panel_complete, index = c("id", "year"))
pdim(pdata_complete)

Unbalanced Panel: n = 200, T = 6-8, N = 1552

Unbalanced panels and fixed effects

An unbalanced panel is not invalid — plm and fixest both handle it correctly. But it is worth understanding why balance was lost: if high-wage or low-wage individuals are disproportionately missing, your within-estimator still operates on a selected sample.

Flagging and summarising incomplete spells

What is a “spell”?

In the context of longitudinal data and panel studies, a spell refers to a continuous period of time that an individual spends in a particular state — employed, unemployed, enrolled in education, living in a given region, and so on.

Think of it as a single “chapter” or “streak” in someone’s data: a spell begins when the individual enters a state and ends when they leave it. A person may have multiple spells of the same type over the observation window (e.g., two separate employment spells separated by a period of unemployment).

In the code below, a complete spell means an individual is observed in every time period in the dataset — their “chapter” has no missing pages.

incomplete <- panel_data |>
  group_by(id) |>
  summarise(
    n_obs   = n(),
    n_miss  = sum(is.na(wage)),
    complete = n_obs == max(panel_data |>
                 count(id) |> pull(n))
  ) |>
  ungroup()

incomplete |>
  count(complete) |>
  mutate(label = if_else(complete, "Complete spells", "Incomplete spells")) |>
  kable(col.names = c("Complete", "N individuals", "Label"),
        caption = "Table 6 — Complete vs incomplete individual spells")

Table 6 — Complete vs incomplete individual spells
Complete	N individuals	Label
TRUE	200	Complete spells

Summary

Task	Key function(s)
Read CSV / Stata / SPSS / Excel	`read_csv()`, `read_dta()`, `read_sav()`, `read_excel()`
Standardise column names	`janitor::clean_names()`
Parse dates	`lubridate::ymd()`, `year()`, `month()`
Wide → long	`tidyr::pivot_longer()`
Long → wide	`tidyr::pivot_wider()`
Check balance	`plm::pdim()`
Detect missing	`is.na()`, `drop_na()`, `naniar::gg_miss_var()`
Carry values forward	`tidyr::fill()`
Panel index	`plm::pdata.frame(df, index = c("id", "time"))`
Within/between split	`plm::Within()`, `plm::between()`

Before fitting any model, always verify

Your data is in long format (one row per unit × time).
The unit and time identifiers are correct and uniquely index each row.
The panel is balanced — or you understand why it isn’t.
You know the direction and magnitude of missingness in your outcome.
You have decomposed key variables into within and between components to understand what variation your chosen estimator will exploit.

Common Panel Data Sources

Panel and longitudinal datasets come from a wide range of sources. Below is a selection of well-known survey-based panels and process-generated datasets with a longitudinal dimension.

Household and individual panel surveys

These are purpose-designed surveys that track the same individuals or households over many waves:

PSID — Panel Study of Income Dynamics (USA); one of the longest-running household panel surveys, covering income, wealth, employment, and health since 1968.
NLSY79 — National Longitudinal Survey of Youth 1979 cohort (USA); tracks labour market experiences, education, and family formation.
Understanding Society — UK Household Longitudinal Study; follows ~40,000 households annually on a broad range of social and economic outcomes.
Millennium Cohort Study — Follows ~19,000 children born in the UK in 2000–01 through childhood, adolescence, and into adulthood.
SOEP — Socio-Economic Panel Study (Germany); annual household survey running since 1984, covering income, employment, health, and well-being.
pairfam — Panel Analysis of Intimate Relationships and Family Dynamics (Germany); focuses on partnership, fertility, and intergenerational relations.
SHARE — Survey of Health, Ageing and Retirement in Europe; multi-country panel of individuals aged 50+.

Process-generated longitudinal data

Administrative records and monitoring systems often produce data with a natural longitudinal structure, even though they were not designed as surveys:

Our World in Data — COVID-19 — Country-day panel of cases, deaths, vaccinations, and policy responses; well-documented and regularly updated.
UK Census (Nomis) — Area-level statistics from the decennial UK Census; geographic units can be tracked across census years.
London Air Quality Network — Station-level time series of pollutant concentrations across London.
Copernicus Tree Cover Density — Raster-based estimates of tree cover density across Europe; repeated status maps enable change analysis over time.

Practical considerations

Merging across waves. Many panel datasets are distributed as a separate file per wave (each file is cross-sectional), which must be stacked or merged to create a panel. This is a common and time-consuming step. Some data providers or the research community have published pre-written merge scripts to simplify the process — see for example the Understanding Society syntax library or the SOEP teaching materials.

Panel attrition. Participation in surveys tends to fall over successive waves. Individuals drop out for a variety of reasons — non-contact, refusal, migration, or death — and those who remain may be systematically different from those who leave. If attrition is related to the outcome of interest, estimates based on the surviving sample can be biased. Always check how many units are observed across all waves and consider whether attrition looks random.

--- title: "Working with Panel Data in R" subtitle: "Structure, Reshaping, Indexing, and Missingness" author: "[Dr Clemens Jarnach](https://clemensjarnach.github.io)" affiliation: "Oxford Martin School, University of Oxford" date: "04-16-2026" format: html: toc: true toc-depth: 2 code-fold: false theme: cosmo self-contained: false execute: warning: false message: false --- ```{r} #| label: setup #| include: false if (!file.exists("files/panel_data.csv")) { source("simple_panel_data.r") } library(tidyverse) library(plm) library(lubridate) library(knitr) ``` Before estimating any model, we need to understand the two fundamental shapes that panel data can take, know how to move between them, and be able to detect and address the most common data quality problems. This chapter covers all three. ------------------------------------------------------------------------ ## Two Shapes of Panel Data Panel data — repeated observations on the same units over time — can be stored in two distinct formats. Understanding the difference is essential before you can do anything else. ### Person-level (wide) format In **wide** format, each row represents one unit (a person, firm, or country) and each wave of measurement occupies a separate column. This is how many surveys distribute their data and how non-specialists tend to organise spreadsheets. ```{r} #| label: wide-example wide_example <- tibble( id = 1:4, gender = c("F", "M", "F", "M"), wage_2020 = c(2.1, 3.4, 1.8, 2.9), wage_2021 = c(2.3, 3.6, 1.9, 3.1), wage_2022 = c(2.4, 3.5, 2.2, 3.3), educ_2020 = c(12, 16, 14, 12), educ_2021 = c(12, 16, 16, 12), educ_2022 = c(14, 16, 16, 14) ) wide_example ``` Each person appears on exactly **one row**. Repeated measurements are spread across columns named by variable and wave (`wage_2020`, `wage_2021`, …). This makes it easy to compute within-person changes (subtract column A from column B), but it is awkward for modelling: most statistical functions in R expect one observation per row. ### Person-period (long) format In **long** format, each row is one observation — one unit at one point in time. A person observed at four time points occupies four rows. ```{r} #| label: long-example long_example <- wide_example |> pivot_longer( cols = starts_with("wage_") | starts_with("educ_"), names_to = c(".value", "year"), names_sep = "_" ) |> mutate(year = as.integer(year)) long_example ``` This is the format required by virtually all panel model packages in R (`plm`, `fixest`, `lme4`). Each row has a unit identifier (`id`), a time identifier (`year`), and then the variables measured at that wave. ::: callout-note ### Which format does your data arrive in? Survey data is often wide: one row per respondent, with wave-specific suffixes. Administrative records are usually long: each event or measurement is a separate record. Knowing which you have determines your first step. ::: ------------------------------------------------------------------------ ## Reading Panel Data into R ### CSV and tabular data The workshop dataset — 200 individuals tracked annually from 2016 to 2023 — is stored as a CSV: ```{r} #| label: read-data panel_data <- read_csv("files/panel_data.csv", show_col_types = FALSE) glimpse(panel_data) ``` The dataset is already in **long format**: each row is one person-year. The key identifiers are `id` (person) and `year` (time). ### Stata and SPSS files Many longitudinal datasets are distributed in Stata (`.dta`) or SPSS (`.sav`) format. The `haven` package reads both directly, preserving value labels: ```{r} #| label: haven-example #| eval: false library(haven) # Stata df_stata <- read_dta("your_data.dta") # SPSS df_spss <- read_sav("your_data.sav") # Convert labelled columns to factors df_stata <- df_stata |> mutate(across(where(is.labelled), as_factor)) ``` ### Excel files ```{r} #| label: excel-example #| eval: false library(readxl) df_xl <- read_excel("your_data.xlsx", sheet = "Panel", na = c("", "NA", ".")) ``` ------------------------------------------------------------------------ ## Inspecting the Panel Structure Before modelling, always verify that your data is what you think it is. ### How many units? How many periods? ```{r} #| label: structure-check cat("Individuals :", n_distinct(panel_data$id), "\n") cat("Time periods:", n_distinct(panel_data$year), "\n") cat("Total rows :", nrow(panel_data), "\n") cat("Year range :", min(panel_data$year), "–", max(panel_data$year), "\n") ``` ### Is the panel balanced? A **balanced** panel has the same number of time points for every unit. An **unbalanced** panel does not — some units are observed less frequently. The distinction matters for estimation. ```{r} #| label: balance-check obs_per_person <- panel_data |> count(id, name = "n_obs") obs_per_person |> count(n_obs, name = "n_individuals") |> kable( col.names = c("Observations per person", "Number of individuals"), caption = "Table 1 — Panel balance check" ) ``` ```{r} #| label: balance-plot #| fig-cap: "Figure 1 — Number of observations per individual (balanced panel)" #| fig-height: 3.5 ggplot(obs_per_person, aes(x = n_obs)) + geom_bar(fill = "#045a8d", width = 0.6) + scale_x_continuous(breaks = 1:10) + labs(x = "Observations per individual", y = "Count", title = "A balanced panel: every individual has exactly 8 observations") + theme_minimal(base_size = 13) ``` ### Unit and time coverage at a glance For small panels it can be useful to visualise which units are observed at which times — a "swimmer plot" or tile chart that reveals gaps immediately. ```{r} #| label: coverage-plot #| fig-cap: "Figure 2 — Observation coverage for a random sample of 30 individuals" #| fig-height: 5 set.seed(7) sample_ids <- sample(unique(panel_data$id), 30) panel_data |> filter(id %in% sample_ids) |> mutate(id = factor(id, levels = rev(sort(sample_ids)))) |> ggplot(aes(x = year, y = id)) + geom_tile(fill = "#045a8d", colour = "white", linewidth = 0.3) + scale_x_continuous(breaks = 2016:2023) + labs(x = "Year", y = "Individual ID", title = "Observation grid — filled = observed, gap = missing") + theme_minimal(base_size = 11) + theme(axis.text.y = element_text(size = 7), panel.grid = element_blank()) ``` ------------------------------------------------------------------------ ## Cleaning Panel Data ### Standardise variable names The `janitor` package's `clean_names()` converts all column names to `snake_case`, removes special characters, and ensures uniqueness. Get into the habit of calling it immediately after reading data. ```{r} #| label: clean-names #| eval: false # Requires: install.packages("janitor") library(janitor) # Simulate messy column names from an external file messy <- tibble( `Person ID` = 1:3, `Survey Year` = 2020:2022, `Log Wage (£)` = c(7.2, 7.4, 7.3), `Union Member?` = c(1, 0, 1) ) messy |> clean_names() ``` ### Parse and handle date variables Many longitudinal datasets record time as a date string rather than an integer year. `lubridate` provides a consistent grammar for parsing and computing with dates. ```{r} #| label: lubridate-example date_example <- tibble( id = c(1, 1, 2, 2), interview_dt = c("2020-03-15", "2022-09-01", "2020-06-22", "2022-11-30") ) date_example <- date_example |> mutate( interview_dt = ymd(interview_dt), # parse to Date year = year(interview_dt), # extract year month = month(interview_dt, label = TRUE), wave_gap_days = as.integer( interview_dt - lag(interview_dt), units = "days" ) ) date_example ``` ### Recode and relabel variables ```{r} #| label: recode panel_data <- panel_data |> mutate( gender = factor(gender, levels = c("Male", "Female")), sector = factor(sector, levels = c("Public", "Private")), union_lbl = if_else(union == 1, "Union member", "Non-member") ) panel_data |> count(gender, sector) |> kable(caption = "Table 2 — Cross-tabulation: gender × sector") ``` ------------------------------------------------------------------------ ## Reshaping: Wide ↔ Long ### Long to wide with `pivot_wider` Sometimes you need wide format — for example to compute person-level change scores or to merge with a cross-sectional dataset. ```{r} #| label: pivot-wider wages_wide <- panel_data |> select(id, year, wage) |> pivot_wider( names_from = year, values_from = wage, names_prefix = "wage_" ) head(wages_wide, 5) ``` You can now compute within-person change between any two waves: ```{r} #| label: change-scores wages_wide <- wages_wide |> mutate(wage_change_16_23 = wage_2023 - wage_2016) wages_wide |> select(id, wage_2016, wage_2023, wage_change_16_23) |> slice_head(n = 6) |> kable( col.names = c("ID", "Wage 2016", "Wage 2023", "Change 2016→2023"), caption = "Table 3 — Within-person wage change (wide format)" ) ``` ### Wide to long with `pivot_longer` To go back to long format, or when your data arrives wide: ```{r} #| label: pivot-longer wages_long_again <- wages_wide |> select(id, starts_with("wage_20")) |> # drop the change score pivot_longer( cols = starts_with("wage_20"), names_to = "year", names_prefix = "wage_", values_to = "wage" ) |> mutate(year = as.integer(year)) head(wages_long_again, 8) ``` ### Reshaping multiple variables simultaneously When several variables are wave-stamped (e.g., `wage_2020`, `educ_2020`, `wage_2021`, `educ_2021`), use the `".value"` sentinel in `names_to` to split the column name into variable name and wave: ```{r} #| label: pivot-multi-var # Reconstruct a multi-variable wide dataset to demonstrate multi_wide <- panel_data |> select(id, year, wage, education) |> filter(id <= 3) |> pivot_wider( names_from = year, values_from = c(wage, education) ) # Pivot back to long, splitting "wage_2016" into variable="wage", year="2016" multi_wide |> pivot_longer( cols = -id, names_to = c(".value", "year"), names_sep = "_(?=\\d{4}$)" # split on _ before a 4-digit year ) |> mutate(year = as.integer(year)) |> arrange(id, year) ``` ------------------------------------------------------------------------ ## Indexing: Creating a `pdata.frame` The `plm` package works with **panel data frames** — objects that know which columns identify units and time. Creating one is the gateway to all of `plm`'s modelling and diagnostic functions. ```{r} #| label: pdata-frame pdata <- pdata.frame(panel_data, index = c("id", "year")) # The class is both a pdata.frame and a data.frame class(pdata) ``` ### Inspecting the panel index ```{r} #| label: pdim pdim(pdata) ``` `pdim()` reports the panel dimensions and whether it is balanced. The `T` here is the number of time periods, `N` the number of individuals. ### Within and between variation A central diagnostic in panel analysis is how much variation in your key variables is *within* units (across time) versus *between* units (across individuals). `plm::Within()` and `between()` compute these directly. ```{r} #| label: within-between within_between <- panel_data |> summarise( across( c(wage, education, experience), list( total_sd = \(x) round(sd(x, na.rm = TRUE), 3), between_sd = \(x) round(sd(tapply(x, id, mean, na.rm = TRUE), na.rm = TRUE), 3), within_sd = \(x) round(mean(tapply(x, id, \(v) sd(v, na.rm = TRUE), simplify = TRUE), na.rm = TRUE), 3) ) ) ) |> pivot_longer(everything(), names_to = c("Variable", ".value"), names_sep = "_(?=total|between|within)") |> rename(`Total SD` = total_sd, `Between SD` = between_sd, `Within SD` = within_sd) kable(within_between, caption = "Table 4 — Decomposition of variance: total, between, and within") ``` ::: callout-tip ### Reading the variance decomposition - **Between SD** — how much the *person-average* of a variable varies across individuals. This is the variation exploited by cross-sectional (OLS) models. - **Within SD** — how much a variable varies *within* the same individual over time. This is the variation used by the Fixed Effects estimator. A variable with low within-SD and high between-SD (e.g., gender) carries little information for a within-estimator. A variable with high within-SD (e.g., experience) is well-suited to fixed effects analysis. ::: ```{r} #| label: within-between-plot #| fig-cap: "Figure 3 — Within vs between standard deviation for key variables" #| fig-height: 3.5 within_between |> pivot_longer(c(`Between SD`, `Within SD`), names_to = "Component", values_to = "SD") |> ggplot(aes(x = Variable, y = SD, fill = Component)) + geom_col(position = "dodge", width = 0.6) + scale_fill_manual(values = c("Between SD" = "#045a8d", "Within SD" = "#74add1")) + labs(x = NULL, y = "Standard deviation", fill = NULL, title = "Within vs between variation", subtitle = "FE uses within; OLS mixes both") + theme_minimal(base_size = 13) + theme(legend.position = "top") ``` ------------------------------------------------------------------------ ## Handling Missingness Real panel datasets almost always have missing values. The source and pattern of missingness determines how seriously it threatens your estimates. ### Types of missingness | Type | Definition | Implication | |----|----|----| | **MCAR** | Missing completely at random — missingness is unrelated to any variable | Listwise deletion is unbiased (just less efficient) | | **MAR** | Missing at random — missingness depends on *observed* variables | Multiple imputation or FIML can recover unbiased estimates | | **MNAR** | Missing not at random — missingness depends on the *unobserved* value itself | Requires a selection model; listwise deletion is biased | ### Detecting missingness ```{r} #| label: missing-summary missing_summary <- panel_data |> summarise(across(everything(), \(x) sum(is.na(x)))) |> pivot_longer(everything(), names_to = "Variable", values_to = "N missing") |> mutate(`% missing` = round(`N missing` / nrow(panel_data) * 100, 2)) |> filter(`N missing` > 0) kable(missing_summary, caption = "Table 5 — Missing values by variable") ``` ### Is missingness systematic? {#sec-missing-systematic} The first question to ask: is missingness in the outcome related to observed predictors? If it is, listwise deletion may introduce bias. ```{r} #| label: missing-pattern #| fig-cap: "Figure 4 — Mean wage by year and missingness in the following year" #| fig-height: 4 panel_data |> arrange(id, year) |> group_by(id) |> mutate( missing_next_wage = is.na(dplyr::lead(wage)) ) |> ungroup() |> filter(!is.na(missing_next_wage)) |> group_by(year, missing_next_wage) |> summarise(mean_wage = mean(wage, na.rm = TRUE), .groups = "drop") |> ggplot(aes(x = year, y = mean_wage, colour = missing_next_wage, group = missing_next_wage)) + geom_line(linewidth = 1.1) + geom_point(size = 3) + scale_colour_manual(values = c("FALSE" = "#045a8d", "TRUE" = "#d73027"), labels = c("Wage observed next year", "Wage missing next year")) + labs(x = NULL, y = "Mean wage (£'000s/month)", colour = NULL, title = "Do lower earners drop out more often?", subtitle = "Red = individuals who are missing the following year") + theme_minimal(base_size = 13) + theme(legend.position = "bottom") ``` ### Listwise deletion (complete cases) The default in most R model functions. Observations with any `NA` on the modelled variables are dropped. This is valid only under MCAR. ```{r} #| label: complete-cases panel_complete <- panel_data |> drop_na(wage, education, experience, union) cat("Rows before:", nrow(panel_data), "\n") cat("Rows after :", nrow(panel_complete), "\n") cat("Dropped :", nrow(panel_data) - nrow(panel_complete), "\n") ``` ### Carrying observations forward (`fill`) In some panel datasets, a variable is recorded only when it changes (e.g., a diagnosis or employment status). `tidyr::fill()` propagates the last observed value forward (or backward) within each unit. ```{r} #| label: fill-example fill_example <- tibble( id = c(1, 1, 1, 1, 2, 2, 2, 2), year = rep(2020:2023, 2), status = c("Employed", NA, NA, "Unemployed", "Student", NA, "Employed", NA) ) fill_example |> group_by(id) |> fill(status, .direction = "down") |> ungroup() ``` ### Checking panel balance after dropping missing rows Dropping rows due to missingness can **unbalance** a previously balanced panel. Always recheck after listwise deletion: ```{r} #| label: recheck-balance pdata_complete <- pdata.frame(panel_complete, index = c("id", "year")) pdim(pdata_complete) ``` ::: callout-warning ### Unbalanced panels and fixed effects An unbalanced panel is not invalid — `plm` and `fixest` both handle it correctly. But it is worth understanding *why* balance was lost: if high-wage or low-wage individuals are disproportionately missing, your within-estimator still operates on a selected sample. ::: ### Flagging and summarising incomplete spells ::: callout-note ### What is a "spell"? In the context of longitudinal data and panel studies, a **spell** refers to a continuous period of time that an individual spends in a particular state — employed, unemployed, enrolled in education, living in a given region, and so on. Think of it as a single "chapter" or "streak" in someone's data: a spell begins when the individual enters a state and ends when they leave it. A person may have multiple spells of the same type over the observation window (e.g., two separate employment spells separated by a period of unemployment). In the code below, a *complete spell* means an individual is observed in every time period in the dataset — their "chapter" has no missing pages. ::: ```{r} #| label: incomplete-spells incomplete <- panel_data |> group_by(id) |> summarise( n_obs = n(), n_miss = sum(is.na(wage)), complete = n_obs == max(panel_data |> count(id) |> pull(n)) ) |> ungroup() incomplete |> count(complete) |> mutate(label = if_else(complete, "Complete spells", "Incomplete spells")) |> kable(col.names = c("Complete", "N individuals", "Label"), caption = "Table 6 — Complete vs incomplete individual spells") ``` ------------------------------------------------------------------------ ## Summary | Task | Key function(s) | |----|----| | Read CSV / Stata / SPSS / Excel | `read_csv()`, `read_dta()`, `read_sav()`, `read_excel()` | | Standardise column names | `janitor::clean_names()` | | Parse dates | `lubridate::ymd()`, `year()`, `month()` | | Wide → long | `tidyr::pivot_longer()` | | Long → wide | `tidyr::pivot_wider()` | | Check balance | `plm::pdim()` | | Detect missing | `is.na()`, `drop_na()`, `naniar::gg_miss_var()` | | Carry values forward | `tidyr::fill()` | | Panel index | `plm::pdata.frame(df, index = c("id", "time"))` | | Within/between split | `plm::Within()`, `plm::between()` | ::: callout-important ### Before fitting any model, always verify 1. Your data is in **long format** (one row per unit × time). 2. The unit and time **identifiers are correct** and uniquely index each row. 3. The panel is **balanced** — or you understand why it isn't. 4. You know the **direction and magnitude of missingness** in your outcome. 5. You have decomposed key variables into **within and between components** to understand what variation your chosen estimator will exploit. ::: ------------------------------------------------------------------------ ## Common Panel Data Sources Panel and longitudinal datasets come from a wide range of sources. Below is a selection of well-known survey-based panels and process-generated datasets with a longitudinal dimension. ### Household and individual panel surveys These are purpose-designed surveys that track the same individuals or households over many waves: - [**PSID**](https://psidonline.isr.umich.edu/) — Panel Study of Income Dynamics (USA); one of the longest-running household panel surveys, covering income, wealth, employment, and health since 1968. - [**NLSY79**](https://www.nlsinfo.org/content/cohorts/nlsy79) — National Longitudinal Survey of Youth 1979 cohort (USA); tracks labour market experiences, education, and family formation. - [**Understanding Society**](https://www.understandingsociety.ac.uk/) — UK Household Longitudinal Study; follows \~40,000 households annually on a broad range of social and economic outcomes. - [**Millennium Cohort Study**](https://cls.ucl.ac.uk/cls-studies/millennium-cohort-study/) — Follows \~19,000 children born in the UK in 2000–01 through childhood, adolescence, and into adulthood. - [**SOEP**](https://www.diw.de/en/diw_01.c.615551.en/research_infrastructure__socio-economic_panel__soep.html) — Socio-Economic Panel Study (Germany); annual household survey running since 1984, covering income, employment, health, and well-being. - [**pairfam**](https://www.pairfam.de/) — Panel Analysis of Intimate Relationships and Family Dynamics (Germany); focuses on partnership, fertility, and intergenerational relations. - [**SHARE**](http://www.share-project.org/home0.html) — Survey of Health, Ageing and Retirement in Europe; multi-country panel of individuals aged 50+. ### Process-generated longitudinal data Administrative records and monitoring systems often produce data with a natural longitudinal structure, even though they were not designed as surveys: - [**Our World in Data — COVID-19**](https://github.com/owid/covid-19-data) — Country-day panel of cases, deaths, vaccinations, and policy responses; well-documented and regularly updated. - [**UK Census (Nomis)**](https://www.nomisweb.co.uk/sources/census) — Area-level statistics from the decennial UK Census; geographic units can be tracked across census years. - [**London Air Quality Network**](https://data.london.gov.uk/air-quality/) — Station-level time series of pollutant concentrations across London. - [**Copernicus Tree Cover Density**](https://land.copernicus.eu/pan-european/high-resolution-layers/forests/tree-cover-density/status-maps) — Raster-based estimates of tree cover density across Europe; repeated status maps enable change analysis over time. ### Practical considerations **Merging across waves.** Many panel datasets are distributed as a separate file per wave (each file is cross-sectional), which must be stacked or merged to create a panel. This is a common and time-consuming step. Some data providers or the research community have published pre-written merge scripts to simplify the process — see for example the [Understanding Society syntax library](https://www.understandingsociety.ac.uk/documentation/mainstage/syntax) or the [SOEP teaching materials](https://www.ls3.soziologie.uni-muenchen.de/studium-lehre/lehrmaterialien/index.html). **Panel attrition.** Participation in surveys tends to fall over successive waves. Individuals drop out for a variety of reasons — non-contact, refusal, migration, or death — and those who remain may be systematically different from those who leave. If attrition is related to the outcome of interest, estimates based on the surviving sample can be biased. Always check how many units are observed across all waves and consider whether attrition looks random.