4  Working with Panel Data in R

Structure, Reshaping, Indexing, and Missingness

Author
Published

April 16, 2026

Before estimating any model, we need to understand the two fundamental shapes that panel data can take, know how to move between them, and be able to detect and address the most common data quality problems. This chapter covers all three.


Two Shapes of Panel Data

Panel data — repeated observations on the same units over time — can be stored in two distinct formats. Understanding the difference is essential before you can do anything else.

Person-level (wide) format

In wide format, each row represents one unit (a person, firm, or country) and each wave of measurement occupies a separate column. This is how many surveys distribute their data and how non-specialists tend to organise spreadsheets.

wide_example <- tibble(
  id     = 1:4,
  gender = c("F", "M", "F", "M"),
  wage_2020 = c(2.1, 3.4, 1.8, 2.9),
  wage_2021 = c(2.3, 3.6, 1.9, 3.1),
  wage_2022 = c(2.4, 3.5, 2.2, 3.3),
  educ_2020 = c(12, 16, 14, 12),
  educ_2021 = c(12, 16, 16, 12),
  educ_2022 = c(14, 16, 16, 14)
)

wide_example
# A tibble: 4 × 8
     id gender wage_2020 wage_2021 wage_2022 educ_2020 educ_2021
  <int> <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1     1 F            2.1       2.3       2.4        12        12
2     2 M            3.4       3.6       3.5        16        16
3     3 F            1.8       1.9       2.2        14        16
4     4 M            2.9       3.1       3.3        12        12
# ℹ 1 more variable: educ_2022 <dbl>

Each person appears on exactly one row. Repeated measurements are spread across columns named by variable and wave (wage_2020, wage_2021, …). This makes it easy to compute within-person changes (subtract column A from column B), but it is awkward for modelling: most statistical functions in R expect one observation per row.

Person-period (long) format

In long format, each row is one observation — one unit at one point in time. A person observed at four time points occupies four rows.

long_example <- wide_example |>
  pivot_longer(
    cols      = starts_with("wage_") | starts_with("educ_"),
    names_to  = c(".value", "year"),
    names_sep = "_"
  ) |>
  mutate(year = as.integer(year))

long_example
# A tibble: 12 × 5
      id gender  year  wage  educ
   <int> <chr>  <int> <dbl> <dbl>
 1     1 F       2020   2.1    12
 2     1 F       2021   2.3    12
 3     1 F       2022   2.4    14
 4     2 M       2020   3.4    16
 5     2 M       2021   3.6    16
 6     2 M       2022   3.5    16
 7     3 F       2020   1.8    14
 8     3 F       2021   1.9    16
 9     3 F       2022   2.2    16
10     4 M       2020   2.9    12
11     4 M       2021   3.1    12
12     4 M       2022   3.3    14

This is the format required by virtually all panel model packages in R (plm, fixest, lme4). Each row has a unit identifier (id), a time identifier (year), and then the variables measured at that wave.

NoteWhich format does your data arrive in?

Survey data is often wide: one row per respondent, with wave-specific suffixes. Administrative records are usually long: each event or measurement is a separate record. Knowing which you have determines your first step.


Reading Panel Data into R

CSV and tabular data

The workshop dataset — 200 individuals tracked annually from 2016 to 2023 — is stored as a CSV:

panel_data <- read_csv("files/panel_data.csv", show_col_types = FALSE)

glimpse(panel_data)
Rows: 1,600
Columns: 11
$ id         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,…
$ year       <dbl> 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023,…
$ period     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8,…
$ gender     <chr> "Female", "Female", "Female", "Female", "Female…
$ sector     <chr> "Public", "Public", "Public", "Public", "Public…
$ education  <dbl> 10, 10, 11, 11, 11, 12, 12, 12, 11, 11, 12, 12,…
$ experience <dbl> 6, 7, 8, 9, 10, 11, 12, 13, 6, 7, 8, 9, 10, 11,…
$ union      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,…
$ ability    <dbl> -2.0009292, -2.0009292, -2.0009292, -2.0009292,…
$ wage       <dbl> 5.883, 8.100, 7.670, 6.753, 7.688, 10.231, NA, …
$ log_wage   <dbl> 1.772119, 2.091868, 2.037348, 1.909919, 2.03972…

The dataset is already in long format: each row is one person-year. The key identifiers are id (person) and year (time).

Stata and SPSS files

Many longitudinal datasets are distributed in Stata (.dta) or SPSS (.sav) format. The haven package reads both directly, preserving value labels:

library(haven)

# Stata
df_stata <- read_dta("your_data.dta")

# SPSS
df_spss  <- read_sav("your_data.sav")

# Convert labelled columns to factors
df_stata <- df_stata |> mutate(across(where(is.labelled), as_factor))

Excel files

library(readxl)

df_xl <- read_excel("your_data.xlsx", sheet = "Panel", na = c("", "NA", "."))

Inspecting the Panel Structure

Before modelling, always verify that your data is what you think it is.

How many units? How many periods?

cat("Individuals :", n_distinct(panel_data$id),   "\n")
Individuals : 200 
cat("Time periods:", n_distinct(panel_data$year),  "\n")
Time periods: 8 
cat("Total rows  :", nrow(panel_data),             "\n")
Total rows  : 1600 
cat("Year range  :", min(panel_data$year), "–", max(panel_data$year), "\n")
Year range  : 2016 – 2023 

Is the panel balanced?

A balanced panel has the same number of time points for every unit. An unbalanced panel does not — some units are observed less frequently. The distinction matters for estimation.

obs_per_person <- panel_data |>
  count(id, name = "n_obs")

obs_per_person |>
  count(n_obs, name = "n_individuals") |>
  kable(
    col.names = c("Observations per person", "Number of individuals"),
    caption   = "Table 1 — Panel balance check"
  )
Table 1 — Panel balance check
Observations per person Number of individuals
8 200
ggplot(obs_per_person, aes(x = n_obs)) +
  geom_bar(fill = "#045a8d", width = 0.6) +
  scale_x_continuous(breaks = 1:10) +
  labs(x = "Observations per individual", y = "Count",
       title = "A balanced panel: every individual has exactly 8 observations") +
  theme_minimal(base_size = 13)

Figure 1 — Number of observations per individual (balanced panel)

Unit and time coverage at a glance

For small panels it can be useful to visualise which units are observed at which times — a “swimmer plot” or tile chart that reveals gaps immediately.

set.seed(7)
sample_ids <- sample(unique(panel_data$id), 30)

panel_data |>
  filter(id %in% sample_ids) |>
  mutate(id = factor(id, levels = rev(sort(sample_ids)))) |>
  ggplot(aes(x = year, y = id)) +
  geom_tile(fill = "#045a8d", colour = "white", linewidth = 0.3) +
  scale_x_continuous(breaks = 2016:2023) +
  labs(x = "Year", y = "Individual ID",
       title = "Observation grid — filled = observed, gap = missing") +
  theme_minimal(base_size = 11) +
  theme(axis.text.y = element_text(size = 7),
        panel.grid  = element_blank())

Figure 2 — Observation coverage for a random sample of 30 individuals

Cleaning Panel Data

Standardise variable names

The janitor package’s clean_names() converts all column names to snake_case, removes special characters, and ensures uniqueness. Get into the habit of calling it immediately after reading data.

# Requires: install.packages("janitor")
library(janitor)

# Simulate messy column names from an external file
messy <- tibble(
  `Person ID` = 1:3,
  `Survey Year` = 2020:2022,
  `Log Wage (£)` = c(7.2, 7.4, 7.3),
  `Union Member?` = c(1, 0, 1)
)

messy |> clean_names()

Parse and handle date variables

Many longitudinal datasets record time as a date string rather than an integer year. lubridate provides a consistent grammar for parsing and computing with dates.

date_example <- tibble(
  id           = c(1, 1, 2, 2),
  interview_dt = c("2020-03-15", "2022-09-01", "2020-06-22", "2022-11-30")
)

date_example <- date_example |>
  mutate(
    interview_dt = ymd(interview_dt),           # parse to Date
    year         = year(interview_dt),          # extract year
    month        = month(interview_dt, label = TRUE),
    wave_gap_days = as.integer(
      interview_dt - lag(interview_dt), units = "days"
    )
  )

date_example
# A tibble: 4 × 5
     id interview_dt  year month wave_gap_days
  <dbl> <date>       <dbl> <ord>         <int>
1     1 2020-03-15    2020 Mar               0
2     1 2022-09-01    2022 Sep               0
3     2 2020-06-22    2020 Jun               0
4     2 2022-11-30    2022 Nov               0

Recode and relabel variables

panel_data <- panel_data |>
  mutate(
    gender    = factor(gender, levels = c("Male", "Female")),
    sector    = factor(sector, levels = c("Public", "Private")),
    union_lbl = if_else(union == 1, "Union member", "Non-member")
  )

panel_data |>
  count(gender, sector) |>
  kable(caption = "Table 2 — Cross-tabulation: gender × sector")
Table 2 — Cross-tabulation: gender × sector
gender sector n
Male Public 224
Male Private 528
Female Public 240
Female Private 608

Reshaping: Wide ↔︎ Long

Long to wide with pivot_wider

Sometimes you need wide format — for example to compute person-level change scores or to merge with a cross-sectional dataset.

wages_wide <- panel_data |>
  select(id, year, wage) |>
  pivot_wider(
    names_from  = year,
    values_from = wage,
    names_prefix = "wage_"
  )

head(wages_wide, 5)
# A tibble: 5 × 9
     id wage_2016 wage_2017 wage_2018 wage_2019 wage_2020 wage_2021
  <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1     1      5.88      8.1       7.67      6.75      7.69     10.2 
2     2     17.2      19.0      13.8      18.3      17.1      23.2 
3     3     25.5      20.2      21.6      NA        27.4      32.5 
4     4     33.4      31.6      38.9      42.2      40.4      49.1 
5     5      6.36      7.32      8.17     10.2       6.18      6.91
# ℹ 2 more variables: wage_2022 <dbl>, wage_2023 <dbl>

You can now compute within-person change between any two waves:

wages_wide <- wages_wide |>
  mutate(wage_change_16_23 = wage_2023 - wage_2016)

wages_wide |>
  select(id, wage_2016, wage_2023, wage_change_16_23) |>
  slice_head(n = 6) |>
  kable(
    col.names = c("ID", "Wage 2016", "Wage 2023", "Change 2016→2023"),
    caption   = "Table 3 — Within-person wage change (wide format)"
  )
Table 3 — Within-person wage change (wide format)
ID Wage 2016 Wage 2023 Change 2016→2023
1 5.883 8.301 2.418
2 17.217 18.898 1.681
3 25.461 26.789 1.328
4 33.405 46.068 12.663
5 6.365 9.512 3.147
6 9.270 8.428 -0.842

Wide to long with pivot_longer

To go back to long format, or when your data arrives wide:

wages_long_again <- wages_wide |>
  select(id, starts_with("wage_20")) |>      # drop the change score
  pivot_longer(
    cols         = starts_with("wage_20"),
    names_to     = "year",
    names_prefix = "wage_",
    values_to    = "wage"
  ) |>
  mutate(year = as.integer(year))

head(wages_long_again, 8)
# A tibble: 8 × 3
     id  year  wage
  <dbl> <int> <dbl>
1     1  2016  5.88
2     1  2017  8.1 
3     1  2018  7.67
4     1  2019  6.75
5     1  2020  7.69
6     1  2021 10.2 
7     1  2022 NA   
8     1  2023  8.30

Reshaping multiple variables simultaneously

When several variables are wave-stamped (e.g., wage_2020, educ_2020, wage_2021, educ_2021), use the ".value" sentinel in names_to to split the column name into variable name and wave:

# Reconstruct a multi-variable wide dataset to demonstrate
multi_wide <- panel_data |>
  select(id, year, wage, education) |>
  filter(id <= 3) |>
  pivot_wider(
    names_from  = year,
    values_from = c(wage, education)
  )

# Pivot back to long, splitting "wage_2016" into variable="wage", year="2016"
multi_wide |>
  pivot_longer(
    cols      = -id,
    names_to  = c(".value", "year"),
    names_sep = "_(?=\\d{4}$)"           # split on _ before a 4-digit year
  ) |>
  mutate(year = as.integer(year)) |>
  arrange(id, year)
# A tibble: 24 × 4
      id  year  wage education
   <dbl> <int> <dbl>     <dbl>
 1     1  2016  5.88        10
 2     1  2017  8.1         10
 3     1  2018  7.67        11
 4     1  2019  6.75        11
 5     1  2020  7.69        11
 6     1  2021 10.2         12
 7     1  2022 NA           12
 8     1  2023  8.30        12
 9     2  2016 17.2         11
10     2  2017 19.0         11
# ℹ 14 more rows

Indexing: Creating a pdata.frame

The plm package works with panel data frames — objects that know which columns identify units and time. Creating one is the gateway to all of plm’s modelling and diagnostic functions.

pdata <- pdata.frame(panel_data, index = c("id", "year"))

# The class is both a pdata.frame and a data.frame
class(pdata)
[1] "pdata.frame" "data.frame" 

Inspecting the panel index

pdim(pdata)
Balanced Panel: n = 200, T = 8, N = 1600

pdim() reports the panel dimensions and whether it is balanced. The T here is the number of time periods, N the number of individuals.

Within and between variation

A central diagnostic in panel analysis is how much variation in your key variables is within units (across time) versus between units (across individuals). plm::Within() and between() compute these directly.

within_between <- panel_data |>
  summarise(
    across(
      c(wage, education, experience),
      list(
        total_sd  = \(x) round(sd(x, na.rm = TRUE), 3),
        between_sd = \(x) round(sd(tapply(x, id, mean, na.rm = TRUE),
                                  na.rm = TRUE), 3),
        within_sd  = \(x) round(mean(tapply(x, id,
                                  \(v) sd(v, na.rm = TRUE),
                                  simplify = TRUE), na.rm = TRUE), 3)
      )
    )
  ) |>
  pivot_longer(everything(), names_to = c("Variable", ".value"),
               names_sep = "_(?=total|between|within)") |>
  rename(`Total SD` = total_sd,
         `Between SD` = between_sd,
         `Within SD` = within_sd)

kable(within_between,
      caption = "Table 4 — Decomposition of variance: total, between, and within")
Table 4 — Decomposition of variance: total, between, and within
Variable Total SD Between SD Within SD
wage 9.933 9.338 3.355
education 2.495 2.376 0.830
experience 4.779 4.203 2.449
TipReading the variance decomposition
  • Between SD — how much the person-average of a variable varies across individuals. This is the variation exploited by cross-sectional (OLS) models.
  • Within SD — how much a variable varies within the same individual over time. This is the variation used by the Fixed Effects estimator.

A variable with low within-SD and high between-SD (e.g., gender) carries little information for a within-estimator. A variable with high within-SD (e.g., experience) is well-suited to fixed effects analysis.

within_between |>
  pivot_longer(c(`Between SD`, `Within SD`),
               names_to = "Component", values_to = "SD") |>
  ggplot(aes(x = Variable, y = SD, fill = Component)) +
  geom_col(position = "dodge", width = 0.6) +
  scale_fill_manual(values = c("Between SD" = "#045a8d",
                               "Within SD"  = "#74add1")) +
  labs(x = NULL, y = "Standard deviation", fill = NULL,
       title = "Within vs between variation",
       subtitle = "FE uses within; OLS mixes both") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

Figure 3 — Within vs between standard deviation for key variables

Handling Missingness

Real panel datasets almost always have missing values. The source and pattern of missingness determines how seriously it threatens your estimates.

Types of missingness

Type Definition Implication
MCAR Missing completely at random — missingness is unrelated to any variable Listwise deletion is unbiased (just less efficient)
MAR Missing at random — missingness depends on observed variables Multiple imputation or FIML can recover unbiased estimates
MNAR Missing not at random — missingness depends on the unobserved value itself Requires a selection model; listwise deletion is biased

Detecting missingness

missing_summary <- panel_data |>
  summarise(across(everything(),
                   \(x) sum(is.na(x)))) |>
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "N missing") |>
  mutate(`% missing` = round(`N missing` / nrow(panel_data) * 100, 2)) |>
  filter(`N missing` > 0)

kable(missing_summary,
      caption = "Table 5 — Missing values by variable")
Table 5 — Missing values by variable
Variable N missing % missing
wage 48 3
log_wage 48 3

Is missingness systematic?

The first question to ask: is missingness in the outcome related to observed predictors? If it is, listwise deletion may introduce bias.

panel_data |>
  arrange(id, year) |>
  group_by(id) |>
  mutate(
    missing_next_wage = is.na(dplyr::lead(wage))
  ) |>
  ungroup() |>
  filter(!is.na(missing_next_wage)) |>
  group_by(year, missing_next_wage) |>
  summarise(mean_wage = mean(wage, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = year, y = mean_wage,
             colour = missing_next_wage, group = missing_next_wage)) +
  geom_line(linewidth = 1.1) +
  geom_point(size = 3) +
  scale_colour_manual(values = c("FALSE" = "#045a8d", "TRUE" = "#d73027"),
                      labels  = c("Wage observed next year",
                                  "Wage missing next year")) +
  labs(x = NULL, y = "Mean wage (£'000s/month)", colour = NULL,
       title = "Do lower earners drop out more often?",
       subtitle = "Red = individuals who are missing the following year") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

Figure 4 — Mean wage by year and missingness in the following year

Listwise deletion (complete cases)

The default in most R model functions. Observations with any NA on the modelled variables are dropped. This is valid only under MCAR.

panel_complete <- panel_data |> drop_na(wage, education, experience, union)

cat("Rows before:", nrow(panel_data),    "\n")
Rows before: 1600 
cat("Rows after :", nrow(panel_complete), "\n")
Rows after : 1552 
cat("Dropped    :", nrow(panel_data) - nrow(panel_complete), "\n")
Dropped    : 48 

Carrying observations forward (fill)

In some panel datasets, a variable is recorded only when it changes (e.g., a diagnosis or employment status). tidyr::fill() propagates the last observed value forward (or backward) within each unit.

fill_example <- tibble(
  id     = c(1, 1, 1, 1, 2, 2, 2, 2),
  year   = rep(2020:2023, 2),
  status = c("Employed", NA, NA, "Unemployed",
             "Student",  NA, "Employed", NA)
)

fill_example |>
  group_by(id) |>
  fill(status, .direction = "down") |>
  ungroup()
# A tibble: 8 × 3
     id  year status    
  <dbl> <int> <chr>     
1     1  2020 Employed  
2     1  2021 Employed  
3     1  2022 Employed  
4     1  2023 Unemployed
5     2  2020 Student   
6     2  2021 Student   
7     2  2022 Employed  
8     2  2023 Employed  

Checking panel balance after dropping missing rows

Dropping rows due to missingness can unbalance a previously balanced panel. Always recheck after listwise deletion:

pdata_complete <- pdata.frame(panel_complete, index = c("id", "year"))
pdim(pdata_complete)
Unbalanced Panel: n = 200, T = 6-8, N = 1552
WarningUnbalanced panels and fixed effects

An unbalanced panel is not invalid — plm and fixest both handle it correctly. But it is worth understanding why balance was lost: if high-wage or low-wage individuals are disproportionately missing, your within-estimator still operates on a selected sample.

Flagging and summarising incomplete spells

NoteWhat is a “spell”?

In the context of longitudinal data and panel studies, a spell refers to a continuous period of time that an individual spends in a particular state — employed, unemployed, enrolled in education, living in a given region, and so on.

Think of it as a single “chapter” or “streak” in someone’s data: a spell begins when the individual enters a state and ends when they leave it. A person may have multiple spells of the same type over the observation window (e.g., two separate employment spells separated by a period of unemployment).

In the code below, a complete spell means an individual is observed in every time period in the dataset — their “chapter” has no missing pages.

incomplete <- panel_data |>
  group_by(id) |>
  summarise(
    n_obs   = n(),
    n_miss  = sum(is.na(wage)),
    complete = n_obs == max(panel_data |>
                 count(id) |> pull(n))
  ) |>
  ungroup()

incomplete |>
  count(complete) |>
  mutate(label = if_else(complete, "Complete spells", "Incomplete spells")) |>
  kable(col.names = c("Complete", "N individuals", "Label"),
        caption = "Table 6 — Complete vs incomplete individual spells")
Table 6 — Complete vs incomplete individual spells
Complete N individuals Label
TRUE 200 Complete spells

Summary

Task Key function(s)
Read CSV / Stata / SPSS / Excel read_csv(), read_dta(), read_sav(), read_excel()
Standardise column names janitor::clean_names()
Parse dates lubridate::ymd(), year(), month()
Wide → long tidyr::pivot_longer()
Long → wide tidyr::pivot_wider()
Check balance plm::pdim()
Detect missing is.na(), drop_na(), naniar::gg_miss_var()
Carry values forward tidyr::fill()
Panel index plm::pdata.frame(df, index = c("id", "time"))
Within/between split plm::Within(), plm::between()
ImportantBefore fitting any model, always verify
  1. Your data is in long format (one row per unit × time).
  2. The unit and time identifiers are correct and uniquely index each row.
  3. The panel is balanced — or you understand why it isn’t.
  4. You know the direction and magnitude of missingness in your outcome.
  5. You have decomposed key variables into within and between components to understand what variation your chosen estimator will exploit.

Common Panel Data Sources

Panel and longitudinal datasets come from a wide range of sources. Below is a selection of well-known survey-based panels and process-generated datasets with a longitudinal dimension.

Household and individual panel surveys

These are purpose-designed surveys that track the same individuals or households over many waves:

  • PSID — Panel Study of Income Dynamics (USA); one of the longest-running household panel surveys, covering income, wealth, employment, and health since 1968.
  • NLSY79 — National Longitudinal Survey of Youth 1979 cohort (USA); tracks labour market experiences, education, and family formation.
  • Understanding Society — UK Household Longitudinal Study; follows ~40,000 households annually on a broad range of social and economic outcomes.
  • Millennium Cohort Study — Follows ~19,000 children born in the UK in 2000–01 through childhood, adolescence, and into adulthood.
  • SOEP — Socio-Economic Panel Study (Germany); annual household survey running since 1984, covering income, employment, health, and well-being.
  • pairfam — Panel Analysis of Intimate Relationships and Family Dynamics (Germany); focuses on partnership, fertility, and intergenerational relations.
  • SHARE — Survey of Health, Ageing and Retirement in Europe; multi-country panel of individuals aged 50+.

Process-generated longitudinal data

Administrative records and monitoring systems often produce data with a natural longitudinal structure, even though they were not designed as surveys:

  • Our World in Data — COVID-19 — Country-day panel of cases, deaths, vaccinations, and policy responses; well-documented and regularly updated.
  • UK Census (Nomis) — Area-level statistics from the decennial UK Census; geographic units can be tracked across census years.
  • London Air Quality Network — Station-level time series of pollutant concentrations across London.
  • Copernicus Tree Cover Density — Raster-based estimates of tree cover density across Europe; repeated status maps enable change analysis over time.

Practical considerations

Merging across waves. Many panel datasets are distributed as a separate file per wave (each file is cross-sectional), which must be stacked or merged to create a panel. This is a common and time-consuming step. Some data providers or the research community have published pre-written merge scripts to simplify the process — see for example the Understanding Society syntax library or the SOEP teaching materials.

Panel attrition. Participation in surveys tends to fall over successive waves. Individuals drop out for a variety of reasons — non-contact, refusal, migration, or death — and those who remain may be systematically different from those who leave. If attrition is related to the outcome of interest, estimates based on the surviving sample can be biased. Always check how many units are observed across all waves and consider whether attrition looks random.