Fixed or Random — Which Model Fits?

From Fixed Effects to Random Effects, and How to Choose

Author

Dr Clemens Jarnach

Published

February 1, 2027

Load packages

library(tidyverse)
library(plm)
library(knitr)
library(RColorBrewer)

pal_farms <- brewer.pal(5, "Set2")

Recap — What We Learned in Exercise 1

In the first exercise we tracked five farms over two years and found:

Pooled OLS was badly biased — it confused high-quality farms with productive farms.
Fixed Effects (FE) eliminated that bias by using within-farm variation only: holding each farm constant and asking “when this farm hired more workers, did it produce more?”
The FE estimate ($\hat\beta_{\text{FE}} \approx 2.5$) was close to the true value; OLS was far above it.

The FE trade-off

FE is powerful because it controls for all time-invariant unobserved heterogeneity — soil quality, management culture, location — without ever measuring it. But that power comes at a cost:

FE can only use within-unit variation. Anything that does not change over time within a unit cannot be estimated.

If you want to know whether organic certification (a fixed farm characteristic) affects productivity, FE cannot help — it has already “swept out” all between-farm differences.

This exercise asks: is there an alternative that uses more of the data and can still estimate time-invariant effects?

Step 1 — Extending the Panel to Four Years

We now follow the same five farms across four years instead of two. Each farm changes its labour input modestly from year to year.

Generate 4-year panel data

set.seed(42)
true_beta <- 2.5
farms     <- LETTERS[1:5]
years     <- paste0("Year ", 1:4)

# Unobserved soil quality — correlated with labour (same as Exercise 1)
# Farm A is best, E is worst; better farms hire more workers
quality_a <- c(75, 70, 63, 52, 35)

# Labour matrix: rows = farms, columns = years
# Better farms (A, B) expand labour over time; worse farms stay small
labour_mat <- matrix(
  c( 6,  7,  8, 10,   # Farm A
     5,  5,  6,  8,   # Farm B
     4,  5,  4,  6,   # Farm C
     3,  3,  4,  4,   # Farm D
     1,  2,  1,  2),  # Farm E
  nrow = 5, byrow = TRUE
)

# Output = quality + beta*labour + noise
# as.vector(labour_mat) goes column-by-column: all farms in Year 1, then Year 2, …
# matching rep(years, each=5) and rep(farms, times=4)
set.seed(42)
output_mat_a <- matrix(0, nrow = 5, ncol = 4)
for (t in 1:4) {
  output_mat_a[, t] <- quality_a + true_beta * labour_mat[, t] + rnorm(5, 0, 1.5)
}
output_mat_a <- round(output_mat_a)

df_panel <- tibble(
  Farm   = rep(farms, times = 4),
  Year   = rep(years, each  = 5),
  t      = rep(1:4,   each  = 5),
  Labour = as.vector(labour_mat),
  Output = as.vector(output_mat_a)
)

Display panel

kable(
  df_panel |> select(Farm, Year, Labour, Output),
  col.names = c("Farm", "Year", "Labour (workers)", "Output (tonnes)"),
  align = "cccc",
  caption = "Table 1 — Extended panel: 5 farms × 4 years = 20 observations"
)

Table 1 — Extended panel: 5 farms × 4 years = 20 observations
Farm	Year	Labour (workers)	Output (tonnes)
A	Year 1	6	92
B	Year 1	5	82
C	Year 1	4	74
D	Year 1	3	60
E	Year 1	1	38
A	Year 2	7	92
B	Year 2	5	85
C	Year 2	5	75
D	Year 2	3	63
E	Year 2	2	40
A	Year 3	8	97
B	Year 3	6	88
C	Year 3	4	71
D	Year 3	4	62
E	Year 3	1	37
A	Year 4	10	101
B	Year 4	8	90
C	Year 4	6	74
D	Year 4	4	58
E	Year 4	2	42

Plot panel trajectories

ggplot(df_panel, aes(x = t, y = Output, colour = Farm, group = Farm)) +
  geom_line(linewidth = 1.1) +
  geom_point(size = 3) +
  scale_x_continuous(breaks = 1:4, labels = years) +
  scale_colour_brewer(palette = "Set2") +
  labs(x = NULL, y = "Output (tonnes)",
       title = "Output trajectories — farms are persistently ordered by soil quality") +
  theme_minimal(base_size = 13)

Figure 1 — Farm output trajectories over four years

What do you notice about the ordering?

Farm A always leads, Farm E always trails. This persistent ordering reflects unobserved soil quality ($\alpha_i$), which is fixed across time. FE controls for this perfectly — but by doing so it also discards all the between-farm information.

Step 2 — Introducing the Random Effects Model

The Random Effects (RE) model offers a middle path. Instead of eliminating $\alpha_i$ by differencing it away, it treats $\alpha_i$ as a random variable drawn from a distribution:

\[y_{it} = \mu + \beta\, x_{it} + \underbrace{\alpha_i + u_{it}}_{\varepsilon_{it}}\]

where:

Symbol	Meaning
$\mu$	Grand intercept
$\beta$	Effect of labour (what we want)
$\alpha_i$	Farm-level random intercept (unobserved heterogeneity)
$u_{it}$	Idiosyncratic noise

The composite error $\varepsilon_{it} = \alpha_i + u_{it}$ has a block structure: observations from the same farm are correlated because they share $\alpha_i$. RE uses Generalised Least Squares (GLS) to exploit this structure, weighting within and between variation optimally.

The critical assumption

RE requires:

\[\text{Cov}(\alpha_i,\, x_{it}) = 0\]

The unobserved farm-level characteristic must be uncorrelated with the predictor.

If this fails — as it did in Exercise 1, where better farms (high $\alpha_i$) also hired more workers (high $x_{it}$) — RE inherits the same omitted-variable bias as OLS. FE, by contrast, is consistent regardless.

Step 3 — Scenario A: When RE Fails (Correlated Heterogeneity)

This is our original farm data from Exercise 1, now extended to four years. Soil quality is strongly correlated with labour: better farms (high $\alpha_i$) consistently hire more workers.

Fit OLS, FE, and RE on correlated data

pdata_a <- pdata.frame(df_panel, index = c("Farm", "Year"))

mod_ols_a <- lm(Output ~ Labour, data = df_panel)
mod_fe_a  <- plm(Output ~ Labour, data = pdata_a, model = "within")
mod_re_a  <- plm(Output ~ Labour, data = pdata_a, model = "random")

beta_ols_a <- round(coef(mod_ols_a)["Labour"], 3)
beta_fe_a  <- round(coef(mod_fe_a)["Labour"],  3)
beta_re_a  <- round(coef(mod_re_a)["Labour"],  3)

Display comparison table

tibble(
  Estimator        = c("Pooled OLS", "Fixed Effects", "Random Effects"),
  `β̂ (Labour)`    = c(beta_ols_a, beta_fe_a, beta_re_a),
  `True β`         = true_beta,
  Bias             = c(round(beta_ols_a - true_beta, 3),
                       round(beta_fe_a  - true_beta, 3),
                       round(beta_re_a  - true_beta, 3)),
  `Variation used` = c("Between + within",
                       "Within only",
                       "Within + between (weighted)")
) |>
  kable(align = "lcccc",
        caption = "Table 2 — Scenario A: heterogeneity correlated with labour")

Table 2 — Scenario A: heterogeneity correlated with labour
Estimator	β̂ (Labour)	True β	Bias	Variation used
Pooled OLS	7.925	2.5	5.425	Between + within
Fixed Effects	2.000	2.5	-0.500	Within only
Random Effects	4.065	2.5	1.565	Within + between (weighted)

Plot all three fits

df_panel_a <- df_panel |>
  group_by(Farm) |>
  mutate(Farm_mean_L = mean(Labour), Farm_mean_O = mean(Output)) |>
  ungroup()

ggplot(df_panel_a, aes(x = Labour, y = Output, colour = Farm, group = Farm)) +
  # FE: parallel within-slope lines, one per farm
  geom_line(aes(y = Farm_mean_O + beta_fe_a * (Labour - Farm_mean_L)),
            linewidth = 1, linetype = "dashed", alpha = 0.7) +
  geom_point(size = 3.5, alpha = 0.85) +
  # Pooled OLS
  geom_abline(intercept = coef(mod_ols_a)[1], slope = beta_ols_a,
              colour = "black", linewidth = 1.1) +
  # RE (single-slope GLS line)
  geom_abline(intercept = coef(mod_re_a)[1], slope = beta_re_a,
              colour = "#E41A1C", linewidth = 1.1, linetype = "dotdash") +
  scale_colour_brewer(palette = "Set2") +
  annotate("text", x = 9.2, y = 100,
           label = paste0("OLS: β̂ = ", beta_ols_a),
           colour = "black", fontface = "bold", size = 3.6) +
  annotate("text", x = 9.2, y = 96,
           label = paste0("RE:  β̂ = ", beta_re_a),
           colour = "#E41A1C", fontface = "bold", size = 3.6) +
  annotate("text", x = 9.2, y = 92,
           label = paste0("FE:  β̂ = ", beta_fe_a,
                          "  ← closest to truth (", true_beta, ")"),
           colour = "grey30", fontface = "italic", size = 3.6) +
  labs(x = "Labour (workers)", y = "Output (tonnes)",
       title = "Scenario A — RE is biased when Cov(αᵢ, xᵢₜ) ≠ 0",
       subtitle = "Solid = OLS  |  Dot-dash = RE  |  Dashed lines = FE (per farm)",
       colour = "Farm") +
  theme_minimal(base_size = 13)

Figure 2 — Scenario A: OLS, FE, and RE fits on correlated data

Interpret the results

In Scenario A:

OLS is the most biased ($\hat\beta_{\text{OLS}} \approx$ 7.925) — it ignores farm identity entirely and conflates quality differences with labour effects.
RE is better than OLS but still biased ($\hat\beta_{\text{RE}} \approx$ 4.065): it borrows some of the contaminated between-farm signal because it cannot distinguish “this farm is productive because it has good soil” from “this farm is productive because it has more workers.”
FE is closest to the truth ($\hat\beta_{\text{FE}} \approx$ 2) because it discards between-farm variation and uses only within-farm changes over time.

The RE estimate sits between OLS and FE because RE is a weighted average of within and between estimators. When the between estimator is biased (as here), RE is partially contaminated.

Step 4 — Scenario B: When RE Succeeds (Uncorrelated Heterogeneity)

Now imagine a policy experiment: the Ministry of Agriculture conducted a randomised trial, randomly assigning additional workers to farms. The randomisation breaks the link between soil quality and labour: a farm’s soil quality no longer predicts how many workers it has.

In this scenario, soil quality for farms A–E is 55, 50, 60, 75, 45. Notice that Farm D — which has the fewest workers on average — now has the best soil. The correlation between quality and mean labour is near zero.

Generate and fit data with uncorrelated heterogeneity

# Quality uncorrelated with mean labour: constructed to be orthogonal
# Farm D (low labour) now has the best soil; Farm A (high labour) does not
quality_b <- c(55, 50, 60, 75, 45)

cat("Correlation of quality with farm mean labour:\n")

Correlation of quality with farm mean labour:

Generate and fit data with uncorrelated heterogeneity

cat("  Scenario A:", round(cor(quality_a, rowMeans(labour_mat)), 3), "\n")

  Scenario A: 0.977

Generate and fit data with uncorrelated heterogeneity

cat("  Scenario B:", round(cor(quality_b, rowMeans(labour_mat)), 3), "\n")

  Scenario B: 0.016

Generate and fit data with uncorrelated heterogeneity

set.seed(77)
output_mat_b <- matrix(0, nrow = 5, ncol = 4)
for (t in 1:4) {
  output_mat_b[, t] <- quality_b + true_beta * labour_mat[, t] + rnorm(5, 0, 1.5)
}
output_mat_b <- round(output_mat_b)

df_panel_b <- tibble(
  Farm   = rep(farms, times = 4),
  Year   = rep(years, each  = 5),
  t      = rep(1:4,   each  = 5),
  Labour = as.vector(labour_mat),
  Output = as.vector(output_mat_b)
)

pdata_b <- pdata.frame(df_panel_b, index = c("Farm", "Year"))

mod_ols_b <- lm(Output ~ Labour, data = df_panel_b)
mod_fe_b  <- plm(Output ~ Labour, data = pdata_b, model = "within")
mod_re_b  <- plm(Output ~ Labour, data = pdata_b, model = "random")

beta_ols_b <- round(coef(mod_ols_b)["Labour"], 3)
beta_fe_b  <- round(coef(mod_fe_b)["Labour"],  3)
beta_re_b  <- round(coef(mod_re_b)["Labour"],  3)

Display comparison table — Scenario B

tibble(
  Estimator        = c("Pooled OLS", "Fixed Effects", "Random Effects"),
  `β̂ (Labour)`    = c(beta_ols_b, beta_fe_b, beta_re_b),
  `True β`         = true_beta,
  Bias             = c(round(beta_ols_b - true_beta, 3),
                       round(beta_fe_b  - true_beta, 3),
                       round(beta_re_b  - true_beta, 3)),
  `SE advantage`   = c("—",
                       "Less efficient (within only)",
                       "More efficient (within + between)")
) |>
  kable(align = "lcccc",
        caption = "Table 3 — Scenario B: heterogeneity uncorrelated with labour")

Table 3 — Scenario B: heterogeneity uncorrelated with labour
Estimator	β̂ (Labour)	True β	Bias	SE advantage
Pooled OLS	2.461	2.5	-0.039	—
Fixed Effects	2.949	2.5	0.449	Less efficient (within only)
Random Effects	2.940	2.5	0.440	More efficient (within + between)

Plot Scenario B

df_panel_b2 <- df_panel_b |>
  group_by(Farm) |>
  mutate(Farm_mean_L = mean(Labour), Farm_mean_O = mean(Output)) |>
  ungroup()

ggplot(df_panel_b2, aes(x = Labour, y = Output, colour = Farm, group = Farm)) +
  geom_line(aes(y = Farm_mean_O + beta_fe_b * (Labour - Farm_mean_L)),
            linewidth = 1, linetype = "dashed", alpha = 0.7) +
  geom_point(size = 3.5, alpha = 0.85) +
  geom_abline(intercept = coef(mod_ols_b)[1], slope = beta_ols_b,
              colour = "black", linewidth = 1.1) +
  geom_abline(intercept = coef(mod_re_b)[1], slope = beta_re_b,
              colour = "#E41A1C", linewidth = 1.1, linetype = "dotdash") +
  scale_colour_brewer(palette = "Set2") +
  annotate("text", x = 9, y = 94,
           label = paste0("OLS: β̂ = ", beta_ols_b),
           colour = "black", fontface = "bold", size = 3.6) +
  annotate("text", x = 9, y = 90.5,
           label = paste0("RE:  β̂ = ", beta_re_b),
           colour = "#E41A1C", fontface = "bold", size = 3.6) +
  annotate("text", x = 9, y = 87,
           label = paste0("FE:  β̂ = ", beta_fe_b,
                          "  (true β = ", true_beta, ")"),
           colour = "grey30", fontface = "italic", size = 3.6) +
  labs(x = "Labour (workers)", y = "Output (tonnes)",
       title = "Scenario B — All three are consistent; RE is more efficient",
       subtitle = "Solid = OLS  |  Dot-dash = RE  |  Dashed lines = FE (per farm)",
       colour = "Farm") +
  theme_minimal(base_size = 13)

Figure 3 — Scenario B: all estimators are consistent when Cov(αᵢ, xᵢₜ) = 0

What changed?

When $\text{Cov}(\alpha_i, x_{it}) \approx 0$:

OLS is now consistent (no omitted-variable bias) but ignores the error correlation structure — inefficient.
FE is consistent but still discards between-farm information — less precise.
RE is consistent and more efficient because it exploits both within and between variation, weighted optimally by the GLS transformation.

When the RE assumption holds, RE is the preferred estimator — unbiased and more precise than FE.

Step 5 — The Unique Advantage: Time-Invariant Predictors

One practical reason to prefer RE over FE: FE cannot estimate the effect of any variable that does not change within a unit over time.

Suppose the Ministry also collected data on each farm’s ownership type — whether the farm is family-owned (0) or a cooperative (1). This is fixed across the four years.

Add time-invariant ownership variable

# Farms A, C, E are family-owned; B and D are cooperatives
ownership <- c(0, 1, 0, 1, 0)
names(ownership) <- farms

df_panel_b <- df_panel_b |>
  mutate(Ownership = ownership[Farm])

pdata_b_own <- pdata.frame(df_panel_b, index = c("Farm", "Year"))

# FE: ownership is collinear with farm dummies — dropped
mod_fe_own <- plm(Output ~ Labour + Ownership,
                  data = pdata_b_own, model = "within")

# RE: can estimate ownership effect
mod_re_own <- plm(Output ~ Labour + Ownership,
                  data = pdata_b_own, model = "random")

Display coefficient tables

tibble(
  Parameter     = c("Labour (β₁)", "Ownership — cooperative (β₂)"),
  `FE estimate` = c(round(coef(mod_fe_own)["Labour"], 3),
                    "— (absorbed by farm FE)"),
  `RE estimate` = c(round(coef(mod_re_own)["Labour"],    3),
                    round(coef(mod_re_own)["Ownership"],  3))
) |>
  kable(align = "lcc",
        caption = "Table 4 — FE cannot identify time-invariant predictors; RE can")

Table 4 — FE cannot identify time-invariant predictors; RE can
Parameter	FE estimate	RE estimate
Labour (β₁)	2.949	2.941
Ownership — cooperative (β₂)	— (absorbed by farm FE)	9.213

Why FE cannot estimate time-invariant variables

Fixed effects can be estimated in several ways — including the within-estimator (demeaning), first-differencing, or explicitly including unit dummy variables — but they all produce the same fundamental result: only within-unit variation is used for identification.

The most common approach, within-demeaning, subtracts each unit’s time-mean from every observation:

\[\tilde{y}_{it} = y_{it} - \bar{y}_i, \quad \tilde{x}_{it} = x_{it} - \bar{x}_i\]

A time-invariant variable like ownership has $x_{it} = x_i$ for all $t$, so after demeaning: $\tilde{x}_{it} = x_i - x_i = 0$. The variable is literally zeroed out — it provides no variation to estimate from. The same logic applies regardless of which estimation approach is used: if a variable never changes within a unit, no within-unit estimator can identify its effect.

RE applies only a partial transformation using the estimated intra-class correlation, leaving some between-unit signal intact. That between-unit signal identifies $\beta_2$.

This is a key practical reason RE is preferred in research designs that include important time-invariant covariates (e.g., country-level institutions, individual characteristics such as gender or ethnicity, historical conditions).

Step 6 — The Hausman Test: Choosing Between FE and RE

If you are uncertain whether $\text{Cov}(\alpha_i, x_{it}) = 0$, the Hausman test gives a formal answer.

Logic:

Under $H_0$: RE assumption holds → both FE and RE are consistent, but RE is more efficient.
Under $H_1$: RE assumption fails → FE is consistent; RE is biased.

The test statistic compares the FE and RE coefficient vectors. A large difference (relative to the sampling variance) rejects $H_0$ and points to FE.

\[H = (\hat\beta_{\text{FE}} - \hat\beta_{\text{RE}})^\top \left[\text{Var}(\hat\beta_{\text{FE}}) - \text{Var}(\hat\beta_{\text{RE}})\right]^{-1} (\hat\beta_{\text{FE}} - \hat\beta_{\text{RE}}) \sim \chi^2(k)\]

Run Hausman test for both scenarios

ht_a <- phtest(mod_fe_a, mod_re_a)   # Scenario A: expect rejection
ht_b <- phtest(mod_fe_b, mod_re_b)   # Scenario B: expect no rejection

Display Hausman test results

tibble(
  Scenario       = c("A — Cov(αᵢ, xᵢₜ) ≠ 0\n(quality correlated with labour)",
                     "B — Cov(αᵢ, xᵢₜ) ≈ 0\n(randomised labour assignment)"),
  `χ² statistic` = c(round(ht_a$statistic, 3), round(ht_b$statistic, 3)),
  df             = c(ht_a$parameter,            ht_b$parameter),
  `p-value`      = c(round(ht_a$p.value, 4),    round(ht_b$p.value, 4)),
  Decision       = c("Reject H₀ → use FE",
                     "Fail to reject H₀ → RE preferred")
) |>
  kable(align = "lcccc",
        caption = "Table 5 — Hausman test results for both scenarios")

Table 5 — Hausman test results for both scenarios
Scenario	χ² statistic	df	p-value	Decision
A — Cov(αᵢ, xᵢₜ) ≠ 0
(quality correlated with labour)	8.137	1	0.0043	Reject H₀ → use FE
B — Cov(αᵢ, xᵢₜ) ≈ 0
(randomised labour assignment)	0.010	1	0.9202	Fail to reject H₀ → RE preferred

Interpreting the Hausman test carefully

The Hausman test has real limitations in practice:

Low power in short panels. With few time periods or few units, the test may fail to detect genuine correlation. A non-rejection does not guarantee RE is safe.
It only tests one direction. The test checks whether RE is as consistent as FE. But if FE itself is the wrong model (e.g., non-linear effects, wrong functional form), the test can give misleading guidance.
Practical knowledge matters. If your theory or prior work strongly suggests that unobserved heterogeneity correlates with your predictor, prefer FE regardless of the test result.

Use the Hausman test as one input, not the final word.

Step 7 — Side-by-Side: All Estimators Compared

Dot plot comparing all estimates

estimates_long <- tibble(
  Estimator = rep(c("Pooled OLS", "Fixed Effects", "Random Effects"), 2),
  Scenario  = rep(c("A: Cov(αᵢ, xᵢₜ) ≠ 0\n(FE needed)",
                    "B: Cov(αᵢ, xᵢₜ) ≈ 0\n(RE preferred)"), each = 3),
  Beta      = c(beta_ols_a, beta_fe_a, beta_re_a,
                beta_ols_b, beta_fe_b, beta_re_b)
) |>
  mutate(
    Estimator = factor(Estimator,
                       levels = c("Pooled OLS", "Random Effects", "Fixed Effects"))
  )

ggplot(estimates_long,
       aes(x = Beta, y = Estimator, colour = Estimator, shape = Estimator)) +
  geom_vline(xintercept = true_beta, linetype = "dashed",
             colour = "grey40", linewidth = 0.8) +
  geom_point(size = 5) +
  annotate("text", x = true_beta + 0.05, y = 0.55,
           label = paste0("True β = ", true_beta),
           colour = "grey40", size = 3.5, hjust = 0) +
  scale_colour_manual(values = c("Pooled OLS"    = "#666666",
                                 "Fixed Effects"  = "#4DAF4A",
                                 "Random Effects" = "#E41A1C")) +
  scale_shape_manual(values  = c("Pooled OLS"    = 15,
                                 "Fixed Effects"  = 17,
                                 "Random Effects" = 16)) +
  facet_wrap(~ Scenario, ncol = 2) +
  labs(x = "Estimated β (Labour → Output)", y = NULL,
       title = "Proximity to the true β across scenarios and estimators") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom", legend.title = element_blank())

Figure 4 — Estimates from all three models across both scenarios

Full comparison table

tibble(
  Estimator = c("Pooled OLS", "Fixed Effects (within)", "Random Effects (GLS)"),
  `Sc A β̂`  = c(beta_ols_a, beta_fe_a, beta_re_a),
  `Sc A Bias`= c(round(beta_ols_a - true_beta, 2),
                 round(beta_fe_a  - true_beta, 2),
                 round(beta_re_a  - true_beta, 2)),
  `Sc B β̂`  = c(beta_ols_b, beta_fe_b, beta_re_b),
  `Sc B Bias`= c(round(beta_ols_b - true_beta, 2),
                 round(beta_fe_b  - true_beta, 2),
                 round(beta_re_b  - true_beta, 2))
) |>
  kable(align = "lcccc",
        col.names = c("Estimator",
                      "Sc A: β̂", "Sc A: Bias",
                      "Sc B: β̂", "Sc B: Bias"),
        caption = paste0("Table 6 — All estimators: true β = ", true_beta))

Table 6 — All estimators: true β = 2.5
Estimator	Sc A: β̂	Sc A: Bias	Sc B: β̂	Sc B: Bias
Pooled OLS	7.925	5.42	2.461	-0.04
Fixed Effects (within)	2.000	-0.50	2.949	0.45
Random Effects (GLS)	4.065	1.57	2.940	0.44

Reading Figure 4

Left panel (Scenario A): Only FE (green triangle) lands near the dashed line. OLS and RE are biased upward because they absorb between-farm variation contaminated by soil quality.
Right panel (Scenario B): All three are close to the true $\beta$. RE (red circle) and FE are both consistent; RE’s advantage here is efficiency (smaller standard errors), not bias correction. OLS is also close because randomisation removed the confounding.

Step 8 — Decision Framework

Use the table below as a starting point when choosing between FE and RE in practice.

Question	FE	RE
Is $\text{Cov}(\alpha_i, x_{it}) = 0$ plausible?	Not required	Required
Can I estimate time-invariant predictors?	No	Yes
Can I estimate effects of variables with little within-unit variation?	Poorly	Better
Is efficiency a concern (small $T$, large $N$)?	Less efficient	More efficient
Hausman test rejects $H_0$?	Preferred	Inconsistent
Research design uses random assignment?	Either works	RE preferred
Observational data with likely self-selection?	Preferred	Only if assumption is defensible

A researcher’s heuristic

“Start with FE. It is the conservative, credible choice. Use RE only when you have a strong theoretical or empirical reason to believe that unobserved heterogeneity is uncorrelated with your regressors — and even then, always report the Hausman test and a FE robustness check side by side.”

Summary

In this exercise you discovered:

Fixed Effects eliminates unobserved heterogeneity by exploiting within-unit variation over time — but at the cost of discarding between-unit information and losing all time-invariant predictors.
Random Effects treats heterogeneity as part of the error and uses GLS to efficiently combine within and between variation — but requires the strong assumption $\text{Cov}(\alpha_i, x_{it}) = 0$.
The Hausman test is the standard diagnostic for choosing between FE and RE. A significant result favours FE; a non-significant result makes RE defensible, though not guaranteed safe.
When the RE assumption holds, RE dominates: unbiased, efficient, and able to estimate time-invariant predictors that FE cannot touch.
In practice, begin with FE for credibility, and treat RE as a robustness check or primary specification only when the research design (e.g., random assignment) justifies it.

Extension: The Mundlak Device

A powerful synthesis due to Mundlak (1978): include the within-unit means of all time-varying regressors as additional controls in the RE model:

\[y_{it} = \mu + \beta\, x_{it} + \gamma\, \bar{x}_i + \alpha_i^* + u_{it}\]

This “Mundlak device” soaks up the correlation between $\alpha_i$ and $x_{it}$, making $\hat\beta$ numerically equivalent to the FE estimator — while the RE framework still allows time-invariant variables to enter through $\bar{x}_i$. It is increasingly recommended as a best-of-both-worlds specification, and provides a further diagnostic: if $\hat\gamma \neq 0$, the RE assumption was violated.

--- title: "Fixed or Random — Which Model Fits?" subtitle: "From Fixed Effects to Random Effects, and How to Choose" author: "Dr Clemens Jarnach" date: "2025-26" format: html: toc: true toc-depth: 2 toc-title: "Steps" number-sections: false theme: cosmo code-fold: true code-summary: "Show R code" code-tools: true self-contained: true execute: echo: true warning: false message: false --- ```{r setup} #| code-fold: true #| code-summary: "Load packages" library(tidyverse) library(plm) library(knitr) library(RColorBrewer) pal_farms <- brewer.pal(5, "Set2") ``` ------------------------------------------------------------------------ ## Recap — What We Learned in Exercise 1 In the first exercise we tracked five farms over two years and found: - **Pooled OLS** was badly biased — it confused high-quality farms with productive farms. - **Fixed Effects (FE)** eliminated that bias by using *within-farm* variation only: holding each farm constant and asking "when this farm hired more workers, did it produce more?" - The FE estimate ($\hat\beta_{\text{FE}} \approx 2.5$) was close to the true value; OLS was far above it. ::: callout-note ### The FE trade-off FE is powerful because it controls for **all** time-invariant unobserved heterogeneity — soil quality, management culture, location — without ever measuring it. But that power comes at a cost: > *FE can only use within-unit variation. Anything that does not change over time within a unit cannot be estimated.* If you want to know whether **organic certification** (a fixed farm characteristic) affects productivity, FE cannot help — it has already "swept out" all between-farm differences. ::: This exercise asks: **is there an alternative that uses more of the data and can still estimate time-invariant effects?** ------------------------------------------------------------------------ ## Step 1 — Extending the Panel to Four Years We now follow the same five farms across **four years** instead of two. Each farm changes its labour input modestly from year to year. ```{r generate-panel} #| code-fold: true #| code-summary: "Generate 4-year panel data" set.seed(42) true_beta <- 2.5 farms <- LETTERS[1:5] years <- paste0("Year ", 1:4) # Unobserved soil quality — correlated with labour (same as Exercise 1) # Farm A is best, E is worst; better farms hire more workers quality_a <- c(75, 70, 63, 52, 35) # Labour matrix: rows = farms, columns = years # Better farms (A, B) expand labour over time; worse farms stay small labour_mat <- matrix( c( 6, 7, 8, 10, # Farm A 5, 5, 6, 8, # Farm B 4, 5, 4, 6, # Farm C 3, 3, 4, 4, # Farm D 1, 2, 1, 2), # Farm E nrow = 5, byrow = TRUE ) # Output = quality + beta*labour + noise # as.vector(labour_mat) goes column-by-column: all farms in Year 1, then Year 2, … # matching rep(years, each=5) and rep(farms, times=4) set.seed(42) output_mat_a <- matrix(0, nrow = 5, ncol = 4) for (t in 1:4) { output_mat_a[, t] <- quality_a + true_beta * labour_mat[, t] + rnorm(5, 0, 1.5) } output_mat_a <- round(output_mat_a) df_panel <- tibble( Farm = rep(farms, times = 4), Year = rep(years, each = 5), t = rep(1:4, each = 5), Labour = as.vector(labour_mat), Output = as.vector(output_mat_a) ) ``` ```{r panel-table} #| code-fold: true #| code-summary: "Display panel" kable( df_panel |> select(Farm, Year, Labour, Output), col.names = c("Farm", "Year", "Labour (workers)", "Output (tonnes)"), align = "cccc", caption = "Table 1 — Extended panel: 5 farms × 4 years = 20 observations" ) ``` ```{r panel-plot} #| code-fold: true #| code-summary: "Plot panel trajectories" #| fig-cap: "Figure 1 — Farm output trajectories over four years" #| fig-height: 4.5 ggplot(df_panel, aes(x = t, y = Output, colour = Farm, group = Farm)) + geom_line(linewidth = 1.1) + geom_point(size = 3) + scale_x_continuous(breaks = 1:4, labels = years) + scale_colour_brewer(palette = "Set2") + labs(x = NULL, y = "Output (tonnes)", title = "Output trajectories — farms are persistently ordered by soil quality") + theme_minimal(base_size = 13) ``` ::: {.callout-tip collapse="true"} ### What do you notice about the ordering? Farm A always leads, Farm E always trails. This persistent ordering reflects **unobserved soil quality** ($\alpha_i$), which is fixed across time. FE controls for this perfectly — but by doing so it also discards all the *between-farm* information. ::: ------------------------------------------------------------------------ ## Step 2 — Introducing the Random Effects Model The **Random Effects (RE) model** offers a middle path. Instead of eliminating $\alpha_i$ by differencing it away, it treats $\alpha_i$ as a **random variable** drawn from a distribution: $$y_{it} = \mu + \beta\, x_{it} + \underbrace{\alpha_i + u_{it}}_{\varepsilon_{it}}$$ where: | Symbol | Meaning | |------------|--------------------------------------------------------| | $\mu$ | Grand intercept | | $\beta$ | Effect of labour (what we want) | | $\alpha_i$ | Farm-level random intercept (unobserved heterogeneity) | | $u_{it}$ | Idiosyncratic noise | The composite error $\varepsilon_{it} = \alpha_i + u_{it}$ has a **block structure**: observations from the same farm are correlated because they share $\alpha_i$. RE uses Generalised Least Squares (GLS) to exploit this structure, weighting within and between variation optimally. ::: callout-important ### The critical assumption RE requires: $$\text{Cov}(\alpha_i,\, x_{it}) = 0$$ > *The unobserved farm-level characteristic must be **uncorrelated** with the predictor.* If this fails — as it did in Exercise 1, where better farms (high $\alpha_i$) also hired more workers (high $x_{it}$) — RE inherits the same omitted-variable bias as OLS. FE, by contrast, is consistent regardless. ::: ------------------------------------------------------------------------ ## Step 3 — Scenario A: When RE Fails (Correlated Heterogeneity) This is our **original farm data** from Exercise 1, now extended to four years. Soil quality is strongly **correlated** with labour: better farms (high $\alpha_i$) consistently hire more workers. ```{r scenario-a} #| code-fold: true #| code-summary: "Fit OLS, FE, and RE on correlated data" pdata_a <- pdata.frame(df_panel, index = c("Farm", "Year")) mod_ols_a <- lm(Output ~ Labour, data = df_panel) mod_fe_a <- plm(Output ~ Labour, data = pdata_a, model = "within") mod_re_a <- plm(Output ~ Labour, data = pdata_a, model = "random") beta_ols_a <- round(coef(mod_ols_a)["Labour"], 3) beta_fe_a <- round(coef(mod_fe_a)["Labour"], 3) beta_re_a <- round(coef(mod_re_a)["Labour"], 3) ``` ```{r scenario-a-table} #| code-fold: true #| code-summary: "Display comparison table" tibble( Estimator = c("Pooled OLS", "Fixed Effects", "Random Effects"), `β̂ (Labour)` = c(beta_ols_a, beta_fe_a, beta_re_a), `True β` = true_beta, Bias = c(round(beta_ols_a - true_beta, 3), round(beta_fe_a - true_beta, 3), round(beta_re_a - true_beta, 3)), `Variation used` = c("Between + within", "Within only", "Within + between (weighted)") ) |> kable(align = "lcccc", caption = "Table 2 — Scenario A: heterogeneity correlated with labour") ``` ```{r scenario-a-plot} #| code-fold: true #| code-summary: "Plot all three fits" #| fig-cap: "Figure 2 — Scenario A: OLS, FE, and RE fits on correlated data" #| fig-height: 5 df_panel_a <- df_panel |> group_by(Farm) |> mutate(Farm_mean_L = mean(Labour), Farm_mean_O = mean(Output)) |> ungroup() ggplot(df_panel_a, aes(x = Labour, y = Output, colour = Farm, group = Farm)) + # FE: parallel within-slope lines, one per farm geom_line(aes(y = Farm_mean_O + beta_fe_a * (Labour - Farm_mean_L)), linewidth = 1, linetype = "dashed", alpha = 0.7) + geom_point(size = 3.5, alpha = 0.85) + # Pooled OLS geom_abline(intercept = coef(mod_ols_a)[1], slope = beta_ols_a, colour = "black", linewidth = 1.1) + # RE (single-slope GLS line) geom_abline(intercept = coef(mod_re_a)[1], slope = beta_re_a, colour = "#E41A1C", linewidth = 1.1, linetype = "dotdash") + scale_colour_brewer(palette = "Set2") + annotate("text", x = 9.2, y = 100, label = paste0("OLS: β̂ = ", beta_ols_a), colour = "black", fontface = "bold", size = 3.6) + annotate("text", x = 9.2, y = 96, label = paste0("RE: β̂ = ", beta_re_a), colour = "#E41A1C", fontface = "bold", size = 3.6) + annotate("text", x = 9.2, y = 92, label = paste0("FE: β̂ = ", beta_fe_a, " ← closest to truth (", true_beta, ")"), colour = "grey30", fontface = "italic", size = 3.6) + labs(x = "Labour (workers)", y = "Output (tonnes)", title = "Scenario A — RE is biased when Cov(αᵢ, xᵢₜ) ≠ 0", subtitle = "Solid = OLS | Dot-dash = RE | Dashed lines = FE (per farm)", colour = "Farm") + theme_minimal(base_size = 13) ``` ::: {.callout-warning collapse="true"} ### Interpret the results In Scenario A: - **OLS** is the most biased ($\hat\beta_{\text{OLS}} \approx$ `r beta_ols_a`) — it ignores farm identity entirely and conflates quality differences with labour effects. - **RE** is better than OLS but **still biased** ($\hat\beta_{\text{RE}} \approx$ `r beta_re_a`): it borrows some of the contaminated between-farm signal because it cannot distinguish "this farm is productive because it has good soil" from "this farm is productive because it has more workers." - **FE** is closest to the truth ($\hat\beta_{\text{FE}} \approx$ `r beta_fe_a`) because it discards between-farm variation and uses only within-farm changes over time. The RE estimate sits *between* OLS and FE because RE is a weighted average of within and between estimators. When the between estimator is biased (as here), RE is partially contaminated. ::: ------------------------------------------------------------------------ ## Step 4 — Scenario B: When RE Succeeds (Uncorrelated Heterogeneity) Now imagine a **policy experiment**: the Ministry of Agriculture conducted a randomised trial, **randomly assigning additional workers** to farms. The randomisation breaks the link between soil quality and labour: a farm's soil quality no longer predicts how many workers it has. In this scenario, soil quality for farms A–E is `r c(55, 50, 60, 75, 45)`. Notice that Farm D — which has the fewest workers on average — now has the *best* soil. The correlation between quality and mean labour is near zero. ```{r scenario-b} #| code-fold: true #| code-summary: "Generate and fit data with uncorrelated heterogeneity" # Quality uncorrelated with mean labour: constructed to be orthogonal # Farm D (low labour) now has the best soil; Farm A (high labour) does not quality_b <- c(55, 50, 60, 75, 45) cat("Correlation of quality with farm mean labour:\n") cat(" Scenario A:", round(cor(quality_a, rowMeans(labour_mat)), 3), "\n") cat(" Scenario B:", round(cor(quality_b, rowMeans(labour_mat)), 3), "\n") set.seed(77) output_mat_b <- matrix(0, nrow = 5, ncol = 4) for (t in 1:4) { output_mat_b[, t] <- quality_b + true_beta * labour_mat[, t] + rnorm(5, 0, 1.5) } output_mat_b <- round(output_mat_b) df_panel_b <- tibble( Farm = rep(farms, times = 4), Year = rep(years, each = 5), t = rep(1:4, each = 5), Labour = as.vector(labour_mat), Output = as.vector(output_mat_b) ) pdata_b <- pdata.frame(df_panel_b, index = c("Farm", "Year")) mod_ols_b <- lm(Output ~ Labour, data = df_panel_b) mod_fe_b <- plm(Output ~ Labour, data = pdata_b, model = "within") mod_re_b <- plm(Output ~ Labour, data = pdata_b, model = "random") beta_ols_b <- round(coef(mod_ols_b)["Labour"], 3) beta_fe_b <- round(coef(mod_fe_b)["Labour"], 3) beta_re_b <- round(coef(mod_re_b)["Labour"], 3) ``` ```{r scenario-b-table} #| code-fold: true #| code-summary: "Display comparison table — Scenario B" tibble( Estimator = c("Pooled OLS", "Fixed Effects", "Random Effects"), `β̂ (Labour)` = c(beta_ols_b, beta_fe_b, beta_re_b), `True β` = true_beta, Bias = c(round(beta_ols_b - true_beta, 3), round(beta_fe_b - true_beta, 3), round(beta_re_b - true_beta, 3)), `SE advantage` = c("—", "Less efficient (within only)", "More efficient (within + between)") ) |> kable(align = "lcccc", caption = "Table 3 — Scenario B: heterogeneity uncorrelated with labour") ``` ```{r scenario-b-plot} #| code-fold: true #| code-summary: "Plot Scenario B" #| fig-cap: "Figure 3 — Scenario B: all estimators are consistent when Cov(αᵢ, xᵢₜ) = 0" #| fig-height: 5 df_panel_b2 <- df_panel_b |> group_by(Farm) |> mutate(Farm_mean_L = mean(Labour), Farm_mean_O = mean(Output)) |> ungroup() ggplot(df_panel_b2, aes(x = Labour, y = Output, colour = Farm, group = Farm)) + geom_line(aes(y = Farm_mean_O + beta_fe_b * (Labour - Farm_mean_L)), linewidth = 1, linetype = "dashed", alpha = 0.7) + geom_point(size = 3.5, alpha = 0.85) + geom_abline(intercept = coef(mod_ols_b)[1], slope = beta_ols_b, colour = "black", linewidth = 1.1) + geom_abline(intercept = coef(mod_re_b)[1], slope = beta_re_b, colour = "#E41A1C", linewidth = 1.1, linetype = "dotdash") + scale_colour_brewer(palette = "Set2") + annotate("text", x = 9, y = 94, label = paste0("OLS: β̂ = ", beta_ols_b), colour = "black", fontface = "bold", size = 3.6) + annotate("text", x = 9, y = 90.5, label = paste0("RE: β̂ = ", beta_re_b), colour = "#E41A1C", fontface = "bold", size = 3.6) + annotate("text", x = 9, y = 87, label = paste0("FE: β̂ = ", beta_fe_b, " (true β = ", true_beta, ")"), colour = "grey30", fontface = "italic", size = 3.6) + labs(x = "Labour (workers)", y = "Output (tonnes)", title = "Scenario B — All three are consistent; RE is more efficient", subtitle = "Solid = OLS | Dot-dash = RE | Dashed lines = FE (per farm)", colour = "Farm") + theme_minimal(base_size = 13) ``` ::: {.callout-tip collapse="true"} ### What changed? When $\text{Cov}(\alpha_i, x_{it}) \approx 0$: - **OLS** is now consistent (no omitted-variable bias) but ignores the error correlation structure — inefficient. - **FE** is consistent but still discards between-farm information — less precise. - **RE** is consistent *and* more efficient because it exploits both within and between variation, weighted optimally by the GLS transformation. > *When the RE assumption holds, RE is the preferred estimator — unbiased and more precise than FE.* ::: ------------------------------------------------------------------------ ## Step 5 — The Unique Advantage: Time-Invariant Predictors One practical reason to prefer RE over FE: **FE cannot estimate the effect of any variable that does not change within a unit over time**. Suppose the Ministry also collected data on each farm's **ownership type** — whether the farm is family-owned (`0`) or a cooperative (`1`). This is fixed across the four years. ```{r time-invariant} #| code-fold: true #| code-summary: "Add time-invariant ownership variable" # Farms A, C, E are family-owned; B and D are cooperatives ownership <- c(0, 1, 0, 1, 0) names(ownership) <- farms df_panel_b <- df_panel_b |> mutate(Ownership = ownership[Farm]) pdata_b_own <- pdata.frame(df_panel_b, index = c("Farm", "Year")) # FE: ownership is collinear with farm dummies — dropped mod_fe_own <- plm(Output ~ Labour + Ownership, data = pdata_b_own, model = "within") # RE: can estimate ownership effect mod_re_own <- plm(Output ~ Labour + Ownership, data = pdata_b_own, model = "random") ``` ```{r time-invariant-table} #| code-fold: true #| code-summary: "Display coefficient tables" tibble( Parameter = c("Labour (β₁)", "Ownership — cooperative (β₂)"), `FE estimate` = c(round(coef(mod_fe_own)["Labour"], 3), "— (absorbed by farm FE)"), `RE estimate` = c(round(coef(mod_re_own)["Labour"], 3), round(coef(mod_re_own)["Ownership"], 3)) ) |> kable(align = "lcc", caption = "Table 4 — FE cannot identify time-invariant predictors; RE can") ``` ::: callout-note ### Why FE cannot estimate time-invariant variables Fixed effects can be estimated in several ways — including the within-estimator (demeaning), first-differencing, or explicitly including unit dummy variables — but they all produce the same fundamental result: only *within-unit* variation is used for identification. The most common approach, within-demeaning, subtracts each unit's time-mean from every observation: $$\tilde{y}_{it} = y_{it} - \bar{y}_i, \quad \tilde{x}_{it} = x_{it} - \bar{x}_i$$ A time-invariant variable like ownership has $x_{it} = x_i$ for all $t$, so after demeaning: $\tilde{x}_{it} = x_i - x_i = 0$. The variable is literally zeroed out — it provides no variation to estimate from. The same logic applies regardless of which estimation approach is used: if a variable never changes within a unit, no within-unit estimator can identify its effect. RE applies only a *partial* transformation using the estimated intra-class correlation, leaving some between-unit signal intact. That between-unit signal identifies $\beta_2$. > *This is a key practical reason RE is preferred in research designs that include important time-invariant covariates (e.g., country-level institutions, individual characteristics such as gender or ethnicity, historical conditions).* ::: ------------------------------------------------------------------------ ## Step 6 — The Hausman Test: Choosing Between FE and RE If you are uncertain whether $\text{Cov}(\alpha_i, x_{it}) = 0$, the **Hausman test** gives a formal answer. **Logic:** - Under $H_0$: RE assumption holds → both FE and RE are consistent, but RE is more efficient. - Under $H_1$: RE assumption fails → FE is consistent; RE is biased. The test statistic compares the FE and RE coefficient vectors. A large difference (relative to the sampling variance) rejects $H_0$ and points to FE. $$H = (\hat\beta_{\text{FE}} - \hat\beta_{\text{RE}})^\top \left[\text{Var}(\hat\beta_{\text{FE}}) - \text{Var}(\hat\beta_{\text{RE}})\right]^{-1} (\hat\beta_{\text{FE}} - \hat\beta_{\text{RE}}) \sim \chi^2(k)$$ ```{r hausman} #| code-fold: true #| code-summary: "Run Hausman test for both scenarios" ht_a <- phtest(mod_fe_a, mod_re_a) # Scenario A: expect rejection ht_b <- phtest(mod_fe_b, mod_re_b) # Scenario B: expect no rejection ``` ```{r hausman-table} #| code-fold: true #| code-summary: "Display Hausman test results" tibble( Scenario = c("A — Cov(αᵢ, xᵢₜ) ≠ 0\n(quality correlated with labour)", "B — Cov(αᵢ, xᵢₜ) ≈ 0\n(randomised labour assignment)"), `χ² statistic` = c(round(ht_a$statistic, 3), round(ht_b$statistic, 3)), df = c(ht_a$parameter, ht_b$parameter), `p-value` = c(round(ht_a$p.value, 4), round(ht_b$p.value, 4)), Decision = c("Reject H₀ → use FE", "Fail to reject H₀ → RE preferred") ) |> kable(align = "lcccc", caption = "Table 5 — Hausman test results for both scenarios") ``` ::: {.callout-warning collapse="true"} ### Interpreting the Hausman test carefully The Hausman test has real limitations in practice: 1. **Low power in short panels.** With few time periods or few units, the test may fail to detect genuine correlation. A non-rejection does not guarantee RE is safe. 2. **It only tests one direction.** The test checks whether RE is *as consistent as* FE. But if FE itself is the wrong model (e.g., non-linear effects, wrong functional form), the test can give misleading guidance. 3. **Practical knowledge matters.** If your theory or prior work strongly suggests that unobserved heterogeneity correlates with your predictor, prefer FE regardless of the test result. > *Use the Hausman test as one input, not the final word.* ::: ------------------------------------------------------------------------ ## Step 7 — Side-by-Side: All Estimators Compared ```{r comparison-dotplot} #| code-fold: true #| code-summary: "Dot plot comparing all estimates" #| fig-cap: "Figure 4 — Estimates from all three models across both scenarios" #| fig-height: 4.5 estimates_long <- tibble( Estimator = rep(c("Pooled OLS", "Fixed Effects", "Random Effects"), 2), Scenario = rep(c("A: Cov(αᵢ, xᵢₜ) ≠ 0\n(FE needed)", "B: Cov(αᵢ, xᵢₜ) ≈ 0\n(RE preferred)"), each = 3), Beta = c(beta_ols_a, beta_fe_a, beta_re_a, beta_ols_b, beta_fe_b, beta_re_b) ) |> mutate( Estimator = factor(Estimator, levels = c("Pooled OLS", "Random Effects", "Fixed Effects")) ) ggplot(estimates_long, aes(x = Beta, y = Estimator, colour = Estimator, shape = Estimator)) + geom_vline(xintercept = true_beta, linetype = "dashed", colour = "grey40", linewidth = 0.8) + geom_point(size = 5) + annotate("text", x = true_beta + 0.05, y = 0.55, label = paste0("True β = ", true_beta), colour = "grey40", size = 3.5, hjust = 0) + scale_colour_manual(values = c("Pooled OLS" = "#666666", "Fixed Effects" = "#4DAF4A", "Random Effects" = "#E41A1C")) + scale_shape_manual(values = c("Pooled OLS" = 15, "Fixed Effects" = 17, "Random Effects" = 16)) + facet_wrap(~ Scenario, ncol = 2) + labs(x = "Estimated β (Labour → Output)", y = NULL, title = "Proximity to the true β across scenarios and estimators") + theme_minimal(base_size = 13) + theme(legend.position = "bottom", legend.title = element_blank()) ``` ```{r full-comparison-table} #| code-fold: true #| code-summary: "Full comparison table" tibble( Estimator = c("Pooled OLS", "Fixed Effects (within)", "Random Effects (GLS)"), `Sc A β̂` = c(beta_ols_a, beta_fe_a, beta_re_a), `Sc A Bias`= c(round(beta_ols_a - true_beta, 2), round(beta_fe_a - true_beta, 2), round(beta_re_a - true_beta, 2)), `Sc B β̂` = c(beta_ols_b, beta_fe_b, beta_re_b), `Sc B Bias`= c(round(beta_ols_b - true_beta, 2), round(beta_fe_b - true_beta, 2), round(beta_re_b - true_beta, 2)) ) |> kable(align = "lcccc", col.names = c("Estimator", "Sc A: β̂", "Sc A: Bias", "Sc B: β̂", "Sc B: Bias"), caption = paste0("Table 6 — All estimators: true β = ", true_beta)) ``` ::: {.callout-tip collapse="true"} ### Reading Figure 4 - **Left panel (Scenario A):** Only FE (green triangle) lands near the dashed line. OLS and RE are biased upward because they absorb between-farm variation contaminated by soil quality. - **Right panel (Scenario B):** All three are close to the true $\beta$. RE (red circle) and FE are both consistent; RE's advantage here is efficiency (smaller standard errors), not bias correction. OLS is also close because randomisation removed the confounding. ::: ------------------------------------------------------------------------ ## Step 8 — Decision Framework Use the table below as a starting point when choosing between FE and RE in practice. | Question | FE | RE | |----|----|----| | Is $\text{Cov}(\alpha_i, x_{it}) = 0$ plausible? | Not required | Required | | Can I estimate time-invariant predictors? | No | Yes | | Can I estimate effects of variables with little within-unit variation? | Poorly | Better | | Is efficiency a concern (small $T$, large $N$)? | Less efficient | More efficient | | Hausman test rejects $H_0$? | Preferred | Inconsistent | | Research design uses random assignment? | Either works | RE preferred | | Observational data with likely self-selection? | Preferred | Only if assumption is defensible | ::: callout-note ### A researcher's heuristic > *"Start with FE. It is the conservative, credible choice. Use RE only when you have a strong theoretical or empirical reason to believe that unobserved heterogeneity is uncorrelated with your regressors — and even then, always report the Hausman test and a FE robustness check side by side."* ::: ------------------------------------------------------------------------ ## Summary In this exercise you discovered: 1. **Fixed Effects** eliminates unobserved heterogeneity by exploiting within-unit variation over time — but at the cost of discarding between-unit information and losing all time-invariant predictors. 2. **Random Effects** treats heterogeneity as part of the error and uses GLS to efficiently combine within and between variation — but requires the strong assumption $\text{Cov}(\alpha_i, x_{it}) = 0$. 3. **The Hausman test** is the standard diagnostic for choosing between FE and RE. A significant result favours FE; a non-significant result makes RE defensible, though not guaranteed safe. 4. **When the RE assumption holds**, RE dominates: unbiased, efficient, and able to estimate time-invariant predictors that FE cannot touch. 5. **In practice**, begin with FE for credibility, and treat RE as a robustness check or primary specification only when the research design (e.g., random assignment) justifies it. ::: callout-important ### Extension: The Mundlak Device A powerful synthesis due to Mundlak (1978): include the **within-unit means of all time-varying regressors** as additional controls in the RE model: $$y_{it} = \mu + \beta\, x_{it} + \gamma\, \bar{x}_i + \alpha_i^* + u_{it}$$ This "Mundlak device" soaks up the correlation between $\alpha_i$ and $x_{it}$, making $\hat\beta$ numerically equivalent to the FE estimator — while the RE framework still allows time-invariant variables to enter through $\bar{x}_i$. It is increasingly recommended as a best-of-both-worlds specification, and provides a further diagnostic: if $\hat\gamma \neq 0$, the RE assumption was violated. :::