Load packages
library(tidyverse)
library(plm)
library(knitr)
library(RColorBrewer)
# Colour palette (Oxford blue + accents)
pal_farms <- brewer.pal(5, "Set2")Discovering Unobserved Heterogeneity Through Farm Data
library(tidyverse)
library(plm)
library(knitr)
library(RColorBrewer)
# Colour palette (Oxford blue + accents)
pal_farms <- brewer.pal(5, "Set2")You are economists hired by a Ministry of Agriculture.
You have been given data on five farms. Your task is to answer one policy question:
Does increasing the number of workers raise farm productivity (output)?
Keep it simple for now: more workers → more output … right?
Below is a snapshot of the five farms in a single year.
set.seed(42)
# Unobserved farm quality (alpha_i) — higher-quality farms also hire more workers
quality <- c(75, 70, 63, 52, 35) # unobserved: A is best, E is worst
farms <- LETTERS[1:5]
labour_cs <- c(10, 8, 6, 4, 2) # correlated with quality!
true_beta <- 2.5 # true effect of one extra worker
# Output = quality + beta * labour + small noise
set.seed(42)
output_cs <- quality + true_beta * labour_cs + rnorm(5, 0, 1.5)
output_cs <- round(output_cs)
df_cs <- tibble(
Farm = farms,
Labour = labour_cs,
Output = output_cs
)kable(
df_cs,
col.names = c("Farm", "Labour (workers)", "Output (tonnes)"),
align = "ccc",
caption = "Table 1 — Cross-sectional snapshot (Year 2)"
)| Farm | Labour (workers) | Output (tonnes) |
|---|---|---|
| A | 10 | 102 |
| B | 8 | 89 |
| C | 6 | 79 |
| D | 4 | 63 |
| E | 2 | 41 |
ols_cs <- lm(Output ~ Labour, data = df_cs)
beta_cs <- round(coef(ols_cs)["Labour"], 2)
ggplot(df_cs, aes(x = Labour, y = Output, colour = Farm, label = Farm)) +
geom_smooth(aes(group = 1), method = "lm", se = TRUE,
colour = "grey40", fill = "grey85", linewidth = 0.8) +
geom_point(size = 5) +
geom_text(nudge_y = 1.8, fontface = "bold", size = 4.5) +
scale_colour_brewer(palette = "Set2") +
annotate("text", x = 8, y = 68,
label = paste0("OLS slope = ", beta_cs, " tonnes / worker"),
colour = "grey30", size = 4, hjust = 0) +
labs(x = "Labour (workers)", y = "Output (tonnes)",
title = "Cross-sectional relationship: Labour → Output") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")But wait — should we trust this number?
“Are all farms equally good, independent of how many workers they hire?”
Take 30 seconds to discuss with your neighbour. What might be driving the pattern in Figure 1?
df_cs_full <- df_cs |>
mutate(Quality = quality,
Quality_label = c("Very high", "High", "Medium", "Low", "Very low"))
kable(
df_cs_full |> select(Farm, Labour, Quality_label, Output),
col.names = c("Farm", "Labour (workers)", "Soil Quality (hidden!)", "Output (tonnes)"),
align = "cccc",
caption = "Table 2 — The hidden variable revealed"
)| Farm | Labour (workers) | Soil Quality (hidden!) | Output (tonnes) |
|---|---|---|---|
| A | 10 | Very high | 102 |
| B | 8 | High | 89 |
| C | 6 | Medium | 79 |
| D | 4 | Low | 63 |
| E | 2 | Very low | 41 |
ggplot(df_cs_full, aes(x = Labour, y = Output, colour = Quality_label, label = Farm)) +
geom_point(aes(size = Quality), alpha = 0.85) +
geom_text(nudge_y = 1.8, fontface = "bold", size = 4.5, colour = "grey20") +
scale_colour_brewer(palette = "YlOrRd", direction = -1,
name = "Soil quality") +
scale_size_continuous(range = c(5, 12), guide = "none") +
labs(x = "Labour (workers)", y = "Output (tonnes)",
title = "Larger dots = better soil quality",
subtitle = "Better farms hire more workers AND produce more — a recipe for bias") +
theme_minimal(base_size = 13) +
theme(legend.position = "right")The OLS estimate of 7.4 tonnes per worker is upward biased because:
\[y_{it} = \alpha + \beta_{\text{OLS}}\, x_{it} + \varepsilon_{it}\]
…but the true model is:
\[y_{it} = \alpha_i + \beta_{\text{true}}\, x_{it} + u_{it}\]
where \(\alpha_i\) = soil quality (unobserved). Since better farms hire more workers, \(\text{Cov}(x_{it},\, \alpha_i) > 0\), so \(\hat\beta_{\text{OLS}} > \beta_{\text{true}}\).
Now we track the same farms over two years. Each farm changed its workforce between years.
set.seed(42)
# Labour levels in year 1 and year 2 (within-farm changes are small)
labour_y1 <- c(6, 5, 4, 3, 1)
labour_y2 <- labour_cs # same as cross-section year 2
output_y1 <- quality + true_beta * labour_y1 + rnorm(5, 0, 1)
output_y1 <- round(output_y1)
df_panel <- tibble(
Farm = rep(farms, 2),
Year = rep(c("Year 1", "Year 2"), each = 5),
Labour = c(labour_y1, labour_y2),
Output = c(output_y1, output_cs)
)
# Wide format for within-farm differences
df_wide <- df_panel |>
pivot_wider(names_from = Year, values_from = c(Labour, Output)) |>
mutate(
Delta_Labour = `Labour_Year 2` - `Labour_Year 1`,
Delta_Output = `Output_Year 2` - `Output_Year 1`,
Within_beta = round(Delta_Output / Delta_Labour, 2)
)kable(
df_panel |> arrange(Farm, Year),
col.names = c("Farm", "Year", "Labour (workers)", "Output (tonnes)"),
align = "cccc",
caption = "Table 3 — Panel data: same farms observed over two years"
)| Farm | Year | Labour (workers) | Output (tonnes) |
|---|---|---|---|
| A | Year 1 | 6 | 91 |
| A | Year 2 | 10 | 102 |
| B | Year 1 | 5 | 82 |
| B | Year 2 | 8 | 89 |
| C | Year 1 | 4 | 73 |
| C | Year 2 | 6 | 79 |
| D | Year 1 | 3 | 60 |
| D | Year 2 | 4 | 63 |
| E | Year 1 | 1 | 38 |
| E | Year 2 | 2 | 41 |
ggplot(df_panel, aes(x = Labour, y = Output,
colour = Farm, group = Farm, label = Farm)) +
geom_line(linewidth = 1.2, alpha = 0.8) +
geom_point(aes(shape = Year), size = 4) +
geom_text(data = df_panel |> filter(Year == "Year 2"),
nudge_x = 0.2, nudge_y = 0.8,
fontface = "bold", size = 4) +
scale_colour_brewer(palette = "Set2") +
scale_shape_manual(values = c("Year 1" = 16, "Year 2" = 17)) +
labs(x = "Labour (workers)", y = "Output (tonnes)",
title = "Within-farm variation: each line connects a farm's two observations",
colour = "Farm", shape = "Period") +
theme_minimal(base_size = 13)kable(
df_wide |>
select(Farm,
"Labour (Y1)" = `Labour_Year 1`,
"Labour (Y2)" = `Labour_Year 2`,
"Output (Y1)" = `Output_Year 1`,
"Output (Y2)" = `Output_Year 2`,
"ΔLabour" = Delta_Labour,
"ΔOutput" = Delta_Output,
"ΔOutput/ΔLabour" = Within_beta),
align = "cccccccc",
caption = "Table 4 — Within-farm changes between Year 1 and Year 2"
)| Farm | Labour (Y1) | Labour (Y2) | Output (Y1) | Output (Y2) | ΔLabour | ΔOutput | ΔOutput/ΔLabour |
|---|---|---|---|---|---|---|---|
| A | 6 | 10 | 91 | 102 | 4 | 11 | 2.75 |
| B | 5 | 8 | 82 | 89 | 3 | 7 | 2.33 |
| C | 4 | 6 | 73 | 79 | 2 | 6 | 3.00 |
| D | 3 | 4 | 60 | 63 | 1 | 3 | 3.00 |
| E | 1 | 2 | 38 | 41 | 1 | 3 | 3.00 |
Compare the within-farm slope (Table 4) to the cross-sectional OLS slope from Figure 1.
“The farm does not change its soil between years. Only its labour does.”
What you just discovered is exactly the problem identified by Yair Mundlak in 1961.”
Map the story onto the formal model:
| Story element | Notation |
|---|---|
| Farm output | \(y_{it}\) |
| Labour input | \(x_{it}\) |
| Soil quality / skill | \(\alpha_i\) |
| Year | \(t\) |
| Farm | \(i\) |
The key message:
“We were confusing differences between farms with changes within farms.”
Between-farm variation is contaminated by \(\alpha_i\). Within-farm variation over time is not — because the farm’s soil quality is the same in both years.
# Pooled OLS (ignores farm identity)
mod_ols <- lm(Output ~ Labour, data = df_panel)
# Fixed effects (within estimator via plm)
pdata <- pdata.frame(df_panel, index = c("Farm", "Year"))
mod_fe <- plm(Output ~ Labour, data = pdata, model = "within")
beta_ols <- round(coef(mod_ols)["Labour"], 3)
beta_fe <- round(coef(mod_fe)["Labour"], 3)tibble(
Estimator = c("Pooled OLS", "Fixed Effects (within)"),
`β̂ (Labour)` = c(beta_ols, beta_fe),
`True β` = c(true_beta, true_beta),
Bias = c(round(beta_ols - true_beta, 3),
round(beta_fe - true_beta, 3)),
`Variation used` = c("Between + within farms", "Within farms only")
) |>
kable(align = "lcccc",
caption = "Table 5 — Pooled OLS vs Fixed Effects")| Estimator | β̂ (Labour) | True β | Bias | Variation used |
|---|---|---|---|---|
| Pooled OLS | 7.291 | 2.5 | 4.791 | Between + within farms |
| Fixed Effects (within) | 2.677 | 2.5 | 0.177 | Within farms only |
# Fitted values from FE (add farm means back in for plotting)
df_panel <- df_panel |>
group_by(Farm) |>
mutate(
Labour_dm = Labour - mean(Labour),
Output_dm = Output - mean(Output),
Farm_mean_Labour = mean(Labour),
Farm_mean_Output = mean(Output)
) |>
ungroup()
ggplot(df_panel, aes(x = Labour, y = Output, colour = Farm, group = Farm)) +
# Farm-level parallel FE lines
geom_line(aes(y = Farm_mean_Output + beta_fe * (Labour - Farm_mean_Labour)),
linewidth = 1, linetype = "dashed", alpha = 0.7) +
# Observed points
geom_point(aes(shape = Year), size = 4) +
# Pooled OLS line
geom_abline(intercept = coef(mod_ols)[1], slope = coef(mod_ols)[2],
colour = "black", linewidth = 1.1, linetype = "solid") +
scale_colour_brewer(palette = "Set2") +
scale_shape_manual(values = c("Year 1" = 16, "Year 2" = 17)) +
annotate("text", x = 9, y = 76,
label = paste0("Pooled OLS: β̂ = ", beta_ols),
colour = "black", fontface = "bold", size = 3.8) +
annotate("text", x = 9, y = 72,
label = paste0("Fixed Effects: β̂ = ", beta_fe,
" (true β = ", true_beta, ")"),
colour = "grey30", fontface = "italic", size = 3.8) +
labs(x = "Labour (workers)", y = "Output (tonnes)",
title = "Solid line = Pooled OLS | Dashed lines = Farm-level FE fits",
colour = "Farm", shape = "Period") +
theme_minimal(base_size = 13)The solid black line is the pooled OLS regression line fitted on all 10 observations (5 farms × 2 years) combined, as if they were independent cross-sectional data points. It ignores farm identity entirely — every observation is treated as a separate, unrelated data point. The slope is therefore driven by both between-farm and within-farm variation mixed together. Because high-quality farms have both more labour and more output, the between-farm signal dominates and inflates the slope upward.
The dashed coloured lines, by contrast, are the farm-level FE fits — parallel lines with the within slope (\(\hat\beta_{\text{FE}}\)), one per farm, each anchored to that farm’s own mean. They use only within-farm variation over time, which is why their slope is much flatter and closer to the true \(\beta\).
The contrast between the steep black line and the shallow dashed lines is the visual payoff of the whole exercise: same data, very different estimates — because of what variation each estimator uses.
If we used the cross-sectional (OLS) estimate of 7.291 tonnes per worker:
The fixed effects estimate of 2.677 tonnes per worker — much closer to the true \(\beta = 2.5\) — gives the correct policy signal: labour has a real but much more modest effect on output.
“Fixed effects models do exactly what you just did informally: they hold each unit constant and look only at within-unit variation over time.”