Which Farmers Are More Productive?

Discovering Unobserved Heterogeneity Through Farm Data

Author

Dr Clemens Jarnach

Published

February 1, 2027

Load packages
library(tidyverse)
library(plm)
library(knitr)
library(RColorBrewer)

# Colour palette (Oxford blue + accents)
pal_farms <- brewer.pal(5, "Set2")

Step 1 — Set the Scene

You are economists hired by a Ministry of Agriculture.

You have been given data on five farms. Your task is to answer one policy question:

Does increasing the number of workers raise farm productivity (output)?

Keep it simple for now: more workers → more output … right?


Step 2 — The Cross-Sectional Evidence

Below is a snapshot of the five farms in a single year.

Generate cross-sectional data
set.seed(42)

# Unobserved farm quality (alpha_i) — higher-quality farms also hire more workers
quality <- c(75, 70, 63, 52, 35)        # unobserved: A is best, E is worst
farms   <- LETTERS[1:5]
labour_cs <- c(10, 8, 6, 4, 2)          # correlated with quality!

true_beta <- 2.5                          # true effect of one extra worker

# Output = quality + beta * labour + small noise
set.seed(42)
output_cs <- quality + true_beta * labour_cs + rnorm(5, 0, 1.5)
output_cs <- round(output_cs)

df_cs <- tibble(
  Farm   = farms,
  Labour = labour_cs,
  Output = output_cs
)
Display table
kable(
  df_cs,
  col.names = c("Farm", "Labour (workers)", "Output (tonnes)"),
  align = "ccc",
  caption = "Table 1 — Cross-sectional snapshot (Year 2)"
)
Table 1 — Cross-sectional snapshot (Year 2)
Farm Labour (workers) Output (tonnes)
A 10 102
B 8 89
C 6 79
D 4 63
E 2 41
Plot cross-sectional relationship
ols_cs <- lm(Output ~ Labour, data = df_cs)
beta_cs <- round(coef(ols_cs)["Labour"], 2)

ggplot(df_cs, aes(x = Labour, y = Output, colour = Farm, label = Farm)) +
  geom_smooth(aes(group = 1), method = "lm", se = TRUE,
              colour = "grey40", fill = "grey85", linewidth = 0.8) +
  geom_point(size = 5) +
  geom_text(nudge_y = 1.8, fontface = "bold", size = 4.5) +
  scale_colour_brewer(palette = "Set2") +
  annotate("text", x = 8, y = 68,
           label = paste0("OLS slope = ", beta_cs, " tonnes / worker"),
           colour = "grey30", size = 4, hjust = 0) +
  labs(x = "Labour (workers)", y = "Output (tonnes)",
       title = "Cross-sectional relationship: Labour → Output") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Figure 1 — Pooled OLS fit across farms (cross-section)
  • The slope is strongly positive: roughly 7.4 extra tonnes per additional worker.
  • A naïve reading suggests labour is highly productive.
  • Policy temptation: subsidise hiring to boost output.

But wait — should we trust this number?


Step 3 — Introducing Doubt

CautionThink before you answer

“Are all farms equally good, independent of how many workers they hire?”

Take 30 seconds to discuss with your neighbour. What might be driving the pattern in Figure 1?

Reveal: unobserved soil quality
df_cs_full <- df_cs |>
  mutate(Quality = quality,
         Quality_label = c("Very high", "High", "Medium", "Low", "Very low"))

kable(
  df_cs_full |> select(Farm, Labour, Quality_label, Output),
  col.names = c("Farm", "Labour (workers)", "Soil Quality (hidden!)", "Output (tonnes)"),
  align = "cccc",
  caption = "Table 2 — The hidden variable revealed"
)
Table 2 — The hidden variable revealed
Farm Labour (workers) Soil Quality (hidden!) Output (tonnes)
A 10 Very high 102
B 8 High 89
C 6 Medium 79
D 4 Low 63
E 2 Very low 41
Plot confounding structure
ggplot(df_cs_full, aes(x = Labour, y = Output, colour = Quality_label, label = Farm)) +
  geom_point(aes(size = Quality), alpha = 0.85) +
  geom_text(nudge_y = 1.8, fontface = "bold", size = 4.5, colour = "grey20") +
  scale_colour_brewer(palette = "YlOrRd", direction = -1,
                      name = "Soil quality") +
  scale_size_continuous(range = c(5, 12), guide = "none") +
  labs(x = "Labour (workers)", y = "Output (tonnes)",
       title = "Larger dots = better soil quality",
       subtitle = "Better farms hire more workers AND produce more — a recipe for bias") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "right")

Figure 2 — Labour is correlated with unobserved soil quality

The OLS estimate of 7.4 tonnes per worker is upward biased because:

\[y_{it} = \alpha + \beta_{\text{OLS}}\, x_{it} + \varepsilon_{it}\]

…but the true model is:

\[y_{it} = \alpha_i + \beta_{\text{true}}\, x_{it} + u_{it}\]

where \(\alpha_i\) = soil quality (unobserved). Since better farms hire more workers, \(\text{Cov}(x_{it},\, \alpha_i) > 0\), so \(\hat\beta_{\text{OLS}} > \beta_{\text{true}}\).


Step 4 — The Panel Data Reveal

Now we track the same farms over two years. Each farm changed its workforce between years.

Generate panel data (5 farms × 2 years)
set.seed(42)

# Labour levels in year 1 and year 2 (within-farm changes are small)
labour_y1 <- c(6, 5, 4, 3, 1)
labour_y2 <- labour_cs          # same as cross-section year 2

output_y1 <- quality + true_beta * labour_y1 + rnorm(5, 0, 1)
output_y1 <- round(output_y1)

df_panel <- tibble(
  Farm   = rep(farms, 2),
  Year   = rep(c("Year 1", "Year 2"), each = 5),
  Labour = c(labour_y1, labour_y2),
  Output = c(output_y1, output_cs)
)

# Wide format for within-farm differences
df_wide <- df_panel |>
  pivot_wider(names_from = Year, values_from = c(Labour, Output)) |>
  mutate(
    Delta_Labour = `Labour_Year 2` - `Labour_Year 1`,
    Delta_Output = `Output_Year 2` - `Output_Year 1`,
    Within_beta  = round(Delta_Output / Delta_Labour, 2)
  )
Display panel table
kable(
  df_panel |> arrange(Farm, Year),
  col.names = c("Farm", "Year", "Labour (workers)", "Output (tonnes)"),
  align = "cccc",
  caption = "Table 3 — Panel data: same farms observed over two years"
)
Table 3 — Panel data: same farms observed over two years
Farm Year Labour (workers) Output (tonnes)
A Year 1 6 91
A Year 2 10 102
B Year 1 5 82
B Year 2 8 89
C Year 1 4 73
C Year 2 6 79
D Year 1 3 60
D Year 2 4 63
E Year 1 1 38
E Year 2 2 41
Plot within-farm trends
ggplot(df_panel, aes(x = Labour, y = Output,
                     colour = Farm, group = Farm, label = Farm)) +
  geom_line(linewidth = 1.2, alpha = 0.8) +
  geom_point(aes(shape = Year), size = 4) +
  geom_text(data = df_panel |> filter(Year == "Year 2"),
            nudge_x = 0.2, nudge_y = 0.8,
            fontface = "bold", size = 4) +
  scale_colour_brewer(palette = "Set2") +
  scale_shape_manual(values = c("Year 1" = 16, "Year 2" = 17)) +
  labs(x = "Labour (workers)", y = "Output (tonnes)",
       title = "Within-farm variation: each line connects a farm's two observations",
       colour = "Farm", shape = "Period") +
  theme_minimal(base_size = 13)

Figure 3 — Within-farm changes: each line is one farm over time
Within-farm differences table
kable(
  df_wide |>
    select(Farm,
           "Labour (Y1)" = `Labour_Year 1`,
           "Labour (Y2)" = `Labour_Year 2`,
           "Output (Y1)" = `Output_Year 1`,
           "Output (Y2)" = `Output_Year 2`,
           "ΔLabour" = Delta_Labour,
           "ΔOutput" = Delta_Output,
           "ΔOutput/ΔLabour" = Within_beta),
  align = "cccccccc",
  caption = "Table 4 — Within-farm changes between Year 1 and Year 2"
)
Table 4 — Within-farm changes between Year 1 and Year 2
Farm Labour (Y1) Labour (Y2) Output (Y1) Output (Y2) ΔLabour ΔOutput ΔOutput/ΔLabour
A 6 10 91 102 4 11 2.75
B 5 8 82 89 3 7 2.33
C 4 6 73 79 2 6 3.00
D 3 4 60 63 1 3 3.00
E 1 2 38 41 1 3 3.00

Compare the within-farm slope (Table 4) to the cross-sectional OLS slope from Figure 1.

  • The within-farm estimates are much smaller — close to the true \(\beta = 2.5\).
  • The cross-sectional estimate of 7.4 was inflated because better farms (high \(\alpha_i\)) also hire more workers.
  • When we look within a farm across time, \(\alpha_i\) is held constant — quality does not change between Year 1 and Year 2 for the same farm.

“The farm does not change its soil between years. Only its labour does.”


Step 5 — The Insight: Mundlak (1961)

What you just discovered is exactly the problem identified by Yair Mundlak in 1961.”

Map the story onto the formal model:

Story element Notation
Farm output \(y_{it}\)
Labour input \(x_{it}\)
Soil quality / skill \(\alpha_i\)
Year \(t\)
Farm \(i\)

The key message:

“We were confusing differences between farms with changes within farms.”

Between-farm variation is contaminated by \(\alpha_i\). Within-farm variation over time is not — because the farm’s soil quality is the same in both years.


Step 6 — From Intuition to Fixed Effects

Run and compare OLS vs FE
# Pooled OLS (ignores farm identity)
mod_ols <- lm(Output ~ Labour, data = df_panel)

# Fixed effects (within estimator via plm)
pdata   <- pdata.frame(df_panel, index = c("Farm", "Year"))
mod_fe  <- plm(Output ~ Labour, data = pdata, model = "within")

beta_ols <- round(coef(mod_ols)["Labour"], 3)
beta_fe  <- round(coef(mod_fe)["Labour"], 3)
Display regression comparison
tibble(
  Estimator   = c("Pooled OLS", "Fixed Effects (within)"),
  `β̂ (Labour)` = c(beta_ols, beta_fe),
  `True β`    = c(true_beta, true_beta),
  Bias        = c(round(beta_ols - true_beta, 3),
                  round(beta_fe  - true_beta, 3)),
  `Variation used` = c("Between + within farms", "Within farms only")
) |>
  kable(align = "lcccc",
        caption = "Table 5 — Pooled OLS vs Fixed Effects")
Table 5 — Pooled OLS vs Fixed Effects
Estimator β̂ (Labour) True β Bias Variation used
Pooled OLS 7.291 2.5 4.791 Between + within farms
Fixed Effects (within) 2.677 2.5 0.177 Within farms only
Plot both regression fits
# Fitted values from FE (add farm means back in for plotting)
df_panel <- df_panel |>
  group_by(Farm) |>
  mutate(
    Labour_dm = Labour - mean(Labour),
    Output_dm = Output - mean(Output),
    Farm_mean_Labour = mean(Labour),
    Farm_mean_Output = mean(Output)
  ) |>
  ungroup()

ggplot(df_panel, aes(x = Labour, y = Output, colour = Farm, group = Farm)) +
  # Farm-level parallel FE lines
  geom_line(aes(y = Farm_mean_Output + beta_fe * (Labour - Farm_mean_Labour)),
            linewidth = 1, linetype = "dashed", alpha = 0.7) +
  # Observed points
  geom_point(aes(shape = Year), size = 4) +
  # Pooled OLS line
  geom_abline(intercept = coef(mod_ols)[1], slope = coef(mod_ols)[2],
              colour = "black", linewidth = 1.1, linetype = "solid") +
  scale_colour_brewer(palette = "Set2") +
  scale_shape_manual(values = c("Year 1" = 16, "Year 2" = 17)) +
  annotate("text", x = 9, y = 76,
           label = paste0("Pooled OLS: β̂ = ", beta_ols),
           colour = "black", fontface = "bold", size = 3.8) +
  annotate("text", x = 9, y = 72,
           label = paste0("Fixed Effects: β̂ = ", beta_fe,
                          " (true β = ", true_beta, ")"),
           colour = "grey30", fontface = "italic", size = 3.8) +
  labs(x = "Labour (workers)", y = "Output (tonnes)",
       title = "Solid line = Pooled OLS  |  Dashed lines = Farm-level FE fits",
       colour = "Farm", shape = "Period") +
  theme_minimal(base_size = 13)

Figure 4 — OLS vs within-farm (FE) estimates on the panel data
NoteReading the plot

The solid black line is the pooled OLS regression line fitted on all 10 observations (5 farms × 2 years) combined, as if they were independent cross-sectional data points. It ignores farm identity entirely — every observation is treated as a separate, unrelated data point. The slope is therefore driven by both between-farm and within-farm variation mixed together. Because high-quality farms have both more labour and more output, the between-farm signal dominates and inflates the slope upward.

The dashed coloured lines, by contrast, are the farm-level FE fits — parallel lines with the within slope (\(\hat\beta_{\text{FE}}\)), one per farm, each anchored to that farm’s own mean. They use only within-farm variation over time, which is why their slope is much flatter and closer to the true \(\beta\).

The contrast between the steep black line and the shallow dashed lines is the visual payoff of the whole exercise: same data, very different estimates — because of what variation each estimator uses.

If we used the cross-sectional (OLS) estimate of 7.291 tonnes per worker:

  • We would over-subsidise labour — expecting large output gains from hiring.
  • We would misattribute productivity to labour inputs, when it is really driven by underlying farm quality (\(\alpha_i\)).
  • The government would transfer resources to farms expecting gains that will not materialise.

The fixed effects estimate of 2.677 tonnes per worker — much closer to the true \(\beta = 2.5\) — gives the correct policy signal: labour has a real but much more modest effect on output.

“Fixed effects models do exactly what you just did informally: they hold each unit constant and look only at within-unit variation over time.”