Fixed or Random — Which Model Fits?

From Fixed Effects to Random Effects, and How to Choose

Author

Dr Clemens Jarnach

Published

February 1, 2027

Load packages
library(tidyverse)
library(plm)
library(knitr)
library(RColorBrewer)

pal_farms <- brewer.pal(5, "Set2")

Recap — What We Learned in Exercise 1

In the first exercise we tracked five farms over two years and found:

  • Pooled OLS was badly biased — it confused high-quality farms with productive farms.
  • Fixed Effects (FE) eliminated that bias by using within-farm variation only: holding each farm constant and asking “when this farm hired more workers, did it produce more?”
  • The FE estimate (\(\hat\beta_{\text{FE}} \approx 2.5\)) was close to the true value; OLS was far above it.
NoteThe FE trade-off

FE is powerful because it controls for all time-invariant unobserved heterogeneity — soil quality, management culture, location — without ever measuring it. But that power comes at a cost:

FE can only use within-unit variation. Anything that does not change over time within a unit cannot be estimated.

If you want to know whether organic certification (a fixed farm characteristic) affects productivity, FE cannot help — it has already “swept out” all between-farm differences.

This exercise asks: is there an alternative that uses more of the data and can still estimate time-invariant effects?


Step 1 — Extending the Panel to Four Years

We now follow the same five farms across four years instead of two. Each farm changes its labour input modestly from year to year.

Generate 4-year panel data
set.seed(42)
true_beta <- 2.5
farms     <- LETTERS[1:5]
years     <- paste0("Year ", 1:4)

# Unobserved soil quality — correlated with labour (same as Exercise 1)
# Farm A is best, E is worst; better farms hire more workers
quality_a <- c(75, 70, 63, 52, 35)

# Labour matrix: rows = farms, columns = years
# Better farms (A, B) expand labour over time; worse farms stay small
labour_mat <- matrix(
  c( 6,  7,  8, 10,   # Farm A
     5,  5,  6,  8,   # Farm B
     4,  5,  4,  6,   # Farm C
     3,  3,  4,  4,   # Farm D
     1,  2,  1,  2),  # Farm E
  nrow = 5, byrow = TRUE
)

# Output = quality + beta*labour + noise
# as.vector(labour_mat) goes column-by-column: all farms in Year 1, then Year 2, …
# matching rep(years, each=5) and rep(farms, times=4)
set.seed(42)
output_mat_a <- matrix(0, nrow = 5, ncol = 4)
for (t in 1:4) {
  output_mat_a[, t] <- quality_a + true_beta * labour_mat[, t] + rnorm(5, 0, 1.5)
}
output_mat_a <- round(output_mat_a)

df_panel <- tibble(
  Farm   = rep(farms, times = 4),
  Year   = rep(years, each  = 5),
  t      = rep(1:4,   each  = 5),
  Labour = as.vector(labour_mat),
  Output = as.vector(output_mat_a)
)
Display panel
kable(
  df_panel |> select(Farm, Year, Labour, Output),
  col.names = c("Farm", "Year", "Labour (workers)", "Output (tonnes)"),
  align = "cccc",
  caption = "Table 1 — Extended panel: 5 farms × 4 years = 20 observations"
)
Table 1 — Extended panel: 5 farms × 4 years = 20 observations
Farm Year Labour (workers) Output (tonnes)
A Year 1 6 92
B Year 1 5 82
C Year 1 4 74
D Year 1 3 60
E Year 1 1 38
A Year 2 7 92
B Year 2 5 85
C Year 2 5 75
D Year 2 3 63
E Year 2 2 40
A Year 3 8 97
B Year 3 6 88
C Year 3 4 71
D Year 3 4 62
E Year 3 1 37
A Year 4 10 101
B Year 4 8 90
C Year 4 6 74
D Year 4 4 58
E Year 4 2 42
Plot panel trajectories
ggplot(df_panel, aes(x = t, y = Output, colour = Farm, group = Farm)) +
  geom_line(linewidth = 1.1) +
  geom_point(size = 3) +
  scale_x_continuous(breaks = 1:4, labels = years) +
  scale_colour_brewer(palette = "Set2") +
  labs(x = NULL, y = "Output (tonnes)",
       title = "Output trajectories — farms are persistently ordered by soil quality") +
  theme_minimal(base_size = 13)

Figure 1 — Farm output trajectories over four years

Farm A always leads, Farm E always trails. This persistent ordering reflects unobserved soil quality (\(\alpha_i\)), which is fixed across time. FE controls for this perfectly — but by doing so it also discards all the between-farm information.


Step 2 — Introducing the Random Effects Model

The Random Effects (RE) model offers a middle path. Instead of eliminating \(\alpha_i\) by differencing it away, it treats \(\alpha_i\) as a random variable drawn from a distribution:

\[y_{it} = \mu + \beta\, x_{it} + \underbrace{\alpha_i + u_{it}}_{\varepsilon_{it}}\]

where:

Symbol Meaning
\(\mu\) Grand intercept
\(\beta\) Effect of labour (what we want)
\(\alpha_i\) Farm-level random intercept (unobserved heterogeneity)
\(u_{it}\) Idiosyncratic noise

The composite error \(\varepsilon_{it} = \alpha_i + u_{it}\) has a block structure: observations from the same farm are correlated because they share \(\alpha_i\). RE uses Generalised Least Squares (GLS) to exploit this structure, weighting within and between variation optimally.

ImportantThe critical assumption

RE requires:

\[\text{Cov}(\alpha_i,\, x_{it}) = 0\]

The unobserved farm-level characteristic must be uncorrelated with the predictor.

If this fails — as it did in Exercise 1, where better farms (high \(\alpha_i\)) also hired more workers (high \(x_{it}\)) — RE inherits the same omitted-variable bias as OLS. FE, by contrast, is consistent regardless.


Step 3 — Scenario A: When RE Fails (Correlated Heterogeneity)

This is our original farm data from Exercise 1, now extended to four years. Soil quality is strongly correlated with labour: better farms (high \(\alpha_i\)) consistently hire more workers.

Fit OLS, FE, and RE on correlated data
pdata_a <- pdata.frame(df_panel, index = c("Farm", "Year"))

mod_ols_a <- lm(Output ~ Labour, data = df_panel)
mod_fe_a  <- plm(Output ~ Labour, data = pdata_a, model = "within")
mod_re_a  <- plm(Output ~ Labour, data = pdata_a, model = "random")

beta_ols_a <- round(coef(mod_ols_a)["Labour"], 3)
beta_fe_a  <- round(coef(mod_fe_a)["Labour"],  3)
beta_re_a  <- round(coef(mod_re_a)["Labour"],  3)
Display comparison table
tibble(
  Estimator        = c("Pooled OLS", "Fixed Effects", "Random Effects"),
  `β̂ (Labour)`    = c(beta_ols_a, beta_fe_a, beta_re_a),
  `True β`         = true_beta,
  Bias             = c(round(beta_ols_a - true_beta, 3),
                       round(beta_fe_a  - true_beta, 3),
                       round(beta_re_a  - true_beta, 3)),
  `Variation used` = c("Between + within",
                       "Within only",
                       "Within + between (weighted)")
) |>
  kable(align = "lcccc",
        caption = "Table 2 — Scenario A: heterogeneity correlated with labour")
Table 2 — Scenario A: heterogeneity correlated with labour
Estimator β̂ (Labour) True β Bias Variation used
Pooled OLS 7.925 2.5 5.425 Between + within
Fixed Effects 2.000 2.5 -0.500 Within only
Random Effects 4.065 2.5 1.565 Within + between (weighted)
Plot all three fits
df_panel_a <- df_panel |>
  group_by(Farm) |>
  mutate(Farm_mean_L = mean(Labour), Farm_mean_O = mean(Output)) |>
  ungroup()

ggplot(df_panel_a, aes(x = Labour, y = Output, colour = Farm, group = Farm)) +
  # FE: parallel within-slope lines, one per farm
  geom_line(aes(y = Farm_mean_O + beta_fe_a * (Labour - Farm_mean_L)),
            linewidth = 1, linetype = "dashed", alpha = 0.7) +
  geom_point(size = 3.5, alpha = 0.85) +
  # Pooled OLS
  geom_abline(intercept = coef(mod_ols_a)[1], slope = beta_ols_a,
              colour = "black", linewidth = 1.1) +
  # RE (single-slope GLS line)
  geom_abline(intercept = coef(mod_re_a)[1], slope = beta_re_a,
              colour = "#E41A1C", linewidth = 1.1, linetype = "dotdash") +
  scale_colour_brewer(palette = "Set2") +
  annotate("text", x = 9.2, y = 100,
           label = paste0("OLS: β̂ = ", beta_ols_a),
           colour = "black", fontface = "bold", size = 3.6) +
  annotate("text", x = 9.2, y = 96,
           label = paste0("RE:  β̂ = ", beta_re_a),
           colour = "#E41A1C", fontface = "bold", size = 3.6) +
  annotate("text", x = 9.2, y = 92,
           label = paste0("FE:  β̂ = ", beta_fe_a,
                          "  ← closest to truth (", true_beta, ")"),
           colour = "grey30", fontface = "italic", size = 3.6) +
  labs(x = "Labour (workers)", y = "Output (tonnes)",
       title = "Scenario A — RE is biased when Cov(αᵢ, xᵢₜ) ≠ 0",
       subtitle = "Solid = OLS  |  Dot-dash = RE  |  Dashed lines = FE (per farm)",
       colour = "Farm") +
  theme_minimal(base_size = 13)

Figure 2 — Scenario A: OLS, FE, and RE fits on correlated data

In Scenario A:

  • OLS is the most biased (\(\hat\beta_{\text{OLS}} \approx\) 7.925) — it ignores farm identity entirely and conflates quality differences with labour effects.
  • RE is better than OLS but still biased (\(\hat\beta_{\text{RE}} \approx\) 4.065): it borrows some of the contaminated between-farm signal because it cannot distinguish “this farm is productive because it has good soil” from “this farm is productive because it has more workers.”
  • FE is closest to the truth (\(\hat\beta_{\text{FE}} \approx\) 2) because it discards between-farm variation and uses only within-farm changes over time.

The RE estimate sits between OLS and FE because RE is a weighted average of within and between estimators. When the between estimator is biased (as here), RE is partially contaminated.


Step 4 — Scenario B: When RE Succeeds (Uncorrelated Heterogeneity)

Now imagine a policy experiment: the Ministry of Agriculture conducted a randomised trial, randomly assigning additional workers to farms. The randomisation breaks the link between soil quality and labour: a farm’s soil quality no longer predicts how many workers it has.

In this scenario, soil quality for farms A–E is 55, 50, 60, 75, 45. Notice that Farm D — which has the fewest workers on average — now has the best soil. The correlation between quality and mean labour is near zero.

Generate and fit data with uncorrelated heterogeneity
# Quality uncorrelated with mean labour: constructed to be orthogonal
# Farm D (low labour) now has the best soil; Farm A (high labour) does not
quality_b <- c(55, 50, 60, 75, 45)

cat("Correlation of quality with farm mean labour:\n")
Correlation of quality with farm mean labour:
Generate and fit data with uncorrelated heterogeneity
cat("  Scenario A:", round(cor(quality_a, rowMeans(labour_mat)), 3), "\n")
  Scenario A: 0.977 
Generate and fit data with uncorrelated heterogeneity
cat("  Scenario B:", round(cor(quality_b, rowMeans(labour_mat)), 3), "\n")
  Scenario B: 0.016 
Generate and fit data with uncorrelated heterogeneity
set.seed(77)
output_mat_b <- matrix(0, nrow = 5, ncol = 4)
for (t in 1:4) {
  output_mat_b[, t] <- quality_b + true_beta * labour_mat[, t] + rnorm(5, 0, 1.5)
}
output_mat_b <- round(output_mat_b)

df_panel_b <- tibble(
  Farm   = rep(farms, times = 4),
  Year   = rep(years, each  = 5),
  t      = rep(1:4,   each  = 5),
  Labour = as.vector(labour_mat),
  Output = as.vector(output_mat_b)
)

pdata_b <- pdata.frame(df_panel_b, index = c("Farm", "Year"))

mod_ols_b <- lm(Output ~ Labour, data = df_panel_b)
mod_fe_b  <- plm(Output ~ Labour, data = pdata_b, model = "within")
mod_re_b  <- plm(Output ~ Labour, data = pdata_b, model = "random")

beta_ols_b <- round(coef(mod_ols_b)["Labour"], 3)
beta_fe_b  <- round(coef(mod_fe_b)["Labour"],  3)
beta_re_b  <- round(coef(mod_re_b)["Labour"],  3)
Display comparison table — Scenario B
tibble(
  Estimator        = c("Pooled OLS", "Fixed Effects", "Random Effects"),
  `β̂ (Labour)`    = c(beta_ols_b, beta_fe_b, beta_re_b),
  `True β`         = true_beta,
  Bias             = c(round(beta_ols_b - true_beta, 3),
                       round(beta_fe_b  - true_beta, 3),
                       round(beta_re_b  - true_beta, 3)),
  `SE advantage`   = c("—",
                       "Less efficient (within only)",
                       "More efficient (within + between)")
) |>
  kable(align = "lcccc",
        caption = "Table 3 — Scenario B: heterogeneity uncorrelated with labour")
Table 3 — Scenario B: heterogeneity uncorrelated with labour
Estimator β̂ (Labour) True β Bias SE advantage
Pooled OLS 2.461 2.5 -0.039
Fixed Effects 2.949 2.5 0.449 Less efficient (within only)
Random Effects 2.940 2.5 0.440 More efficient (within + between)
Plot Scenario B
df_panel_b2 <- df_panel_b |>
  group_by(Farm) |>
  mutate(Farm_mean_L = mean(Labour), Farm_mean_O = mean(Output)) |>
  ungroup()

ggplot(df_panel_b2, aes(x = Labour, y = Output, colour = Farm, group = Farm)) +
  geom_line(aes(y = Farm_mean_O + beta_fe_b * (Labour - Farm_mean_L)),
            linewidth = 1, linetype = "dashed", alpha = 0.7) +
  geom_point(size = 3.5, alpha = 0.85) +
  geom_abline(intercept = coef(mod_ols_b)[1], slope = beta_ols_b,
              colour = "black", linewidth = 1.1) +
  geom_abline(intercept = coef(mod_re_b)[1], slope = beta_re_b,
              colour = "#E41A1C", linewidth = 1.1, linetype = "dotdash") +
  scale_colour_brewer(palette = "Set2") +
  annotate("text", x = 9, y = 94,
           label = paste0("OLS: β̂ = ", beta_ols_b),
           colour = "black", fontface = "bold", size = 3.6) +
  annotate("text", x = 9, y = 90.5,
           label = paste0("RE:  β̂ = ", beta_re_b),
           colour = "#E41A1C", fontface = "bold", size = 3.6) +
  annotate("text", x = 9, y = 87,
           label = paste0("FE:  β̂ = ", beta_fe_b,
                          "  (true β = ", true_beta, ")"),
           colour = "grey30", fontface = "italic", size = 3.6) +
  labs(x = "Labour (workers)", y = "Output (tonnes)",
       title = "Scenario B — All three are consistent; RE is more efficient",
       subtitle = "Solid = OLS  |  Dot-dash = RE  |  Dashed lines = FE (per farm)",
       colour = "Farm") +
  theme_minimal(base_size = 13)

Figure 3 — Scenario B: all estimators are consistent when Cov(αᵢ, xᵢₜ) = 0

When \(\text{Cov}(\alpha_i, x_{it}) \approx 0\):

  • OLS is now consistent (no omitted-variable bias) but ignores the error correlation structure — inefficient.
  • FE is consistent but still discards between-farm information — less precise.
  • RE is consistent and more efficient because it exploits both within and between variation, weighted optimally by the GLS transformation.

When the RE assumption holds, RE is the preferred estimator — unbiased and more precise than FE.


Step 5 — The Unique Advantage: Time-Invariant Predictors

One practical reason to prefer RE over FE: FE cannot estimate the effect of any variable that does not change within a unit over time.

Suppose the Ministry also collected data on each farm’s ownership type — whether the farm is family-owned (0) or a cooperative (1). This is fixed across the four years.

Add time-invariant ownership variable
# Farms A, C, E are family-owned; B and D are cooperatives
ownership <- c(0, 1, 0, 1, 0)
names(ownership) <- farms

df_panel_b <- df_panel_b |>
  mutate(Ownership = ownership[Farm])

pdata_b_own <- pdata.frame(df_panel_b, index = c("Farm", "Year"))

# FE: ownership is collinear with farm dummies — dropped
mod_fe_own <- plm(Output ~ Labour + Ownership,
                  data = pdata_b_own, model = "within")

# RE: can estimate ownership effect
mod_re_own <- plm(Output ~ Labour + Ownership,
                  data = pdata_b_own, model = "random")
Display coefficient tables
tibble(
  Parameter     = c("Labour (β₁)", "Ownership — cooperative (β₂)"),
  `FE estimate` = c(round(coef(mod_fe_own)["Labour"], 3),
                    "— (absorbed by farm FE)"),
  `RE estimate` = c(round(coef(mod_re_own)["Labour"],    3),
                    round(coef(mod_re_own)["Ownership"],  3))
) |>
  kable(align = "lcc",
        caption = "Table 4 — FE cannot identify time-invariant predictors; RE can")
Table 4 — FE cannot identify time-invariant predictors; RE can
Parameter FE estimate RE estimate
Labour (β₁) 2.949 2.941
Ownership — cooperative (β₂) — (absorbed by farm FE) 9.213
NoteWhy FE cannot estimate time-invariant variables

Fixed effects can be estimated in several ways — including the within-estimator (demeaning), first-differencing, or explicitly including unit dummy variables — but they all produce the same fundamental result: only within-unit variation is used for identification.

The most common approach, within-demeaning, subtracts each unit’s time-mean from every observation:

\[\tilde{y}_{it} = y_{it} - \bar{y}_i, \quad \tilde{x}_{it} = x_{it} - \bar{x}_i\]

A time-invariant variable like ownership has \(x_{it} = x_i\) for all \(t\), so after demeaning: \(\tilde{x}_{it} = x_i - x_i = 0\). The variable is literally zeroed out — it provides no variation to estimate from. The same logic applies regardless of which estimation approach is used: if a variable never changes within a unit, no within-unit estimator can identify its effect.

RE applies only a partial transformation using the estimated intra-class correlation, leaving some between-unit signal intact. That between-unit signal identifies \(\beta_2\).

This is a key practical reason RE is preferred in research designs that include important time-invariant covariates (e.g., country-level institutions, individual characteristics such as gender or ethnicity, historical conditions).


Step 6 — The Hausman Test: Choosing Between FE and RE

If you are uncertain whether \(\text{Cov}(\alpha_i, x_{it}) = 0\), the Hausman test gives a formal answer.

Logic:

  • Under \(H_0\): RE assumption holds → both FE and RE are consistent, but RE is more efficient.
  • Under \(H_1\): RE assumption fails → FE is consistent; RE is biased.

The test statistic compares the FE and RE coefficient vectors. A large difference (relative to the sampling variance) rejects \(H_0\) and points to FE.

\[H = (\hat\beta_{\text{FE}} - \hat\beta_{\text{RE}})^\top \left[\text{Var}(\hat\beta_{\text{FE}}) - \text{Var}(\hat\beta_{\text{RE}})\right]^{-1} (\hat\beta_{\text{FE}} - \hat\beta_{\text{RE}}) \sim \chi^2(k)\]

Run Hausman test for both scenarios
ht_a <- phtest(mod_fe_a, mod_re_a)   # Scenario A: expect rejection
ht_b <- phtest(mod_fe_b, mod_re_b)   # Scenario B: expect no rejection
Display Hausman test results
tibble(
  Scenario       = c("A — Cov(αᵢ, xᵢₜ) ≠ 0\n(quality correlated with labour)",
                     "B — Cov(αᵢ, xᵢₜ) ≈ 0\n(randomised labour assignment)"),
  `χ² statistic` = c(round(ht_a$statistic, 3), round(ht_b$statistic, 3)),
  df             = c(ht_a$parameter,            ht_b$parameter),
  `p-value`      = c(round(ht_a$p.value, 4),    round(ht_b$p.value, 4)),
  Decision       = c("Reject H₀ → use FE",
                     "Fail to reject H₀ → RE preferred")
) |>
  kable(align = "lcccc",
        caption = "Table 5 — Hausman test results for both scenarios")
Table 5 — Hausman test results for both scenarios
Scenario χ² statistic df p-value Decision
A — Cov(αᵢ, xᵢₜ) ≠ 0
(quality correlated with labour) 8.137 1 0.0043 Reject H₀ → use FE
B — Cov(αᵢ, xᵢₜ) ≈ 0
(randomised labour assignment) 0.010 1 0.9202 Fail to reject H₀ → RE preferred

The Hausman test has real limitations in practice:

  1. Low power in short panels. With few time periods or few units, the test may fail to detect genuine correlation. A non-rejection does not guarantee RE is safe.

  2. It only tests one direction. The test checks whether RE is as consistent as FE. But if FE itself is the wrong model (e.g., non-linear effects, wrong functional form), the test can give misleading guidance.

  3. Practical knowledge matters. If your theory or prior work strongly suggests that unobserved heterogeneity correlates with your predictor, prefer FE regardless of the test result.

Use the Hausman test as one input, not the final word.


Step 7 — Side-by-Side: All Estimators Compared

Dot plot comparing all estimates
estimates_long <- tibble(
  Estimator = rep(c("Pooled OLS", "Fixed Effects", "Random Effects"), 2),
  Scenario  = rep(c("A: Cov(αᵢ, xᵢₜ) ≠ 0\n(FE needed)",
                    "B: Cov(αᵢ, xᵢₜ) ≈ 0\n(RE preferred)"), each = 3),
  Beta      = c(beta_ols_a, beta_fe_a, beta_re_a,
                beta_ols_b, beta_fe_b, beta_re_b)
) |>
  mutate(
    Estimator = factor(Estimator,
                       levels = c("Pooled OLS", "Random Effects", "Fixed Effects"))
  )

ggplot(estimates_long,
       aes(x = Beta, y = Estimator, colour = Estimator, shape = Estimator)) +
  geom_vline(xintercept = true_beta, linetype = "dashed",
             colour = "grey40", linewidth = 0.8) +
  geom_point(size = 5) +
  annotate("text", x = true_beta + 0.05, y = 0.55,
           label = paste0("True β = ", true_beta),
           colour = "grey40", size = 3.5, hjust = 0) +
  scale_colour_manual(values = c("Pooled OLS"    = "#666666",
                                 "Fixed Effects"  = "#4DAF4A",
                                 "Random Effects" = "#E41A1C")) +
  scale_shape_manual(values  = c("Pooled OLS"    = 15,
                                 "Fixed Effects"  = 17,
                                 "Random Effects" = 16)) +
  facet_wrap(~ Scenario, ncol = 2) +
  labs(x = "Estimated β (Labour → Output)", y = NULL,
       title = "Proximity to the true β across scenarios and estimators") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom", legend.title = element_blank())

Figure 4 — Estimates from all three models across both scenarios
Full comparison table
tibble(
  Estimator = c("Pooled OLS", "Fixed Effects (within)", "Random Effects (GLS)"),
  `Sc A β̂`  = c(beta_ols_a, beta_fe_a, beta_re_a),
  `Sc A Bias`= c(round(beta_ols_a - true_beta, 2),
                 round(beta_fe_a  - true_beta, 2),
                 round(beta_re_a  - true_beta, 2)),
  `Sc B β̂`  = c(beta_ols_b, beta_fe_b, beta_re_b),
  `Sc B Bias`= c(round(beta_ols_b - true_beta, 2),
                 round(beta_fe_b  - true_beta, 2),
                 round(beta_re_b  - true_beta, 2))
) |>
  kable(align = "lcccc",
        col.names = c("Estimator",
                      "Sc A: β̂", "Sc A: Bias",
                      "Sc B: β̂", "Sc B: Bias"),
        caption = paste0("Table 6 — All estimators: true β = ", true_beta))
Table 6 — All estimators: true β = 2.5
Estimator Sc A: β̂ Sc A: Bias Sc B: β̂ Sc B: Bias
Pooled OLS 7.925 5.42 2.461 -0.04
Fixed Effects (within) 2.000 -0.50 2.949 0.45
Random Effects (GLS) 4.065 1.57 2.940 0.44
  • Left panel (Scenario A): Only FE (green triangle) lands near the dashed line. OLS and RE are biased upward because they absorb between-farm variation contaminated by soil quality.
  • Right panel (Scenario B): All three are close to the true \(\beta\). RE (red circle) and FE are both consistent; RE’s advantage here is efficiency (smaller standard errors), not bias correction. OLS is also close because randomisation removed the confounding.

Step 8 — Decision Framework

Use the table below as a starting point when choosing between FE and RE in practice.

Question FE RE
Is \(\text{Cov}(\alpha_i, x_{it}) = 0\) plausible? Not required Required
Can I estimate time-invariant predictors? No Yes
Can I estimate effects of variables with little within-unit variation? Poorly Better
Is efficiency a concern (small \(T\), large \(N\))? Less efficient More efficient
Hausman test rejects \(H_0\)? Preferred Inconsistent
Research design uses random assignment? Either works RE preferred
Observational data with likely self-selection? Preferred Only if assumption is defensible
NoteA researcher’s heuristic

“Start with FE. It is the conservative, credible choice. Use RE only when you have a strong theoretical or empirical reason to believe that unobserved heterogeneity is uncorrelated with your regressors — and even then, always report the Hausman test and a FE robustness check side by side.”


Summary

In this exercise you discovered:

  1. Fixed Effects eliminates unobserved heterogeneity by exploiting within-unit variation over time — but at the cost of discarding between-unit information and losing all time-invariant predictors.

  2. Random Effects treats heterogeneity as part of the error and uses GLS to efficiently combine within and between variation — but requires the strong assumption \(\text{Cov}(\alpha_i, x_{it}) = 0\).

  3. The Hausman test is the standard diagnostic for choosing between FE and RE. A significant result favours FE; a non-significant result makes RE defensible, though not guaranteed safe.

  4. When the RE assumption holds, RE dominates: unbiased, efficient, and able to estimate time-invariant predictors that FE cannot touch.

  5. In practice, begin with FE for credibility, and treat RE as a robustness check or primary specification only when the research design (e.g., random assignment) justifies it.

ImportantExtension: The Mundlak Device

A powerful synthesis due to Mundlak (1978): include the within-unit means of all time-varying regressors as additional controls in the RE model:

\[y_{it} = \mu + \beta\, x_{it} + \gamma\, \bar{x}_i + \alpha_i^* + u_{it}\]

This “Mundlak device” soaks up the correlation between \(\alpha_i\) and \(x_{it}\), making \(\hat\beta\) numerically equivalent to the FE estimator — while the RE framework still allows time-invariant variables to enter through \(\bar{x}_i\). It is increasingly recommended as a best-of-both-worlds specification, and provides a further diagnostic: if \(\hat\gamma \neq 0\), the RE assumption was violated.