When does regression recover causal effects?

(resolving that gnawing feeling)

Andrew Pua

Previously…

We calculated summaries and linear regressions using whatever data we have on hand.
- We explored how these aspects may be related to each other algebraically.
- We also discussed how to communicate the results.
- We also discovered the behavior of these procedures in the “long run”.
- We constructed formulas and computational approaches to evaluate the sampling uncertainty.

Now…

It depends on the purpose. Knowing the purpose will help narrow down things.
- If the focus is description of a sample, then a model is not strictly needed.
- If the focus is prediction, then you need to make sure that you have included all the relevant regressors.
- If the focus is testing theory, then theory is your guide.
- If the focus is establishing causality, then you need to have a causal model.

Illustration: Mankiw, Romer, and Weil (1992)

This is an example of a paper that tests theory and then makes adjustments to the theory.
It is about estimating the textbook Solow growth model and extending it to allow for human capital.
In the next slides, you will see a summary of the textbook Solow growth model.

Theory behind MRW

Assumptions on the theoretical model:
- Rates of saving \(s\), population growth \(n\), technological progress \(g\), and depreciation \(\delta\) are exogenous and constant.
- One output \(Y\left(t\right)\); two inputs, capital \(K\left(t\right)\) and labor \(L\left(t\right)\)
- Cobb Douglas technology \(Y\left(t\right)=\left[K\left(t\right)\right]^{\alpha}\left[A\left(t\right)L\left(t\right)\right]^{1-\alpha}\)
- Competitive labor and capital markets

Let \[k\left(t\right)=K\left(t\right)/\left[A\left(t\right)L\left(t\right)\right]\] The evolution of \(k\) is given by \[\dot{k}\left(t\right)=\dfrac{dk\left(t\right)}{dt}=sy\left(t\right)-\left(n+g+\delta\right)k\left(t\right)\]
In the steady state, income per capita (Y/L) is given by \[\log y^{*}=\log A\left(0\right)+gt^{*}+\dfrac{\alpha}{1-\alpha}\log s-\dfrac{\alpha}{1-\alpha}\log\left(n+g+\delta\right).\]

Econometrics behind MRW

Use data across countries at a specific point in time. Denote each country by index \(i\).
Assume that \(g\) and \(\delta\) are constant across countries. Depreciation rates do not vary greatly across countries. (Justifications found in page 410)
Assume that \(\log A\left(0\right)_{i}=a+\varepsilon_{i}\). Thus, \[\log y_{i}^{*} = a+gt^{*}+\dfrac{\alpha}{1-\alpha}\log s_{i}-\dfrac{\alpha}{1-\alpha}\log\left(n_{i}+g+\delta\right)+\varepsilon_{i}.\]

MRW assume that \(s_{i}\) and \(n_{i}\) are independent of \(\varepsilon_{i}\). (Economic justifications are given in page 411.)
Is there an econometric justification for imposing independence?
- Which version of the linear regression model is being used by MRW?
- This is a point you have to aware of when you take transformations of the regressand.
- A similar point is made in research on gravity models in international trade. See Santos Silva and Tenreyro (2006).

Data used by MRW

Variables:
- average rate of growth of working-age population \(n_{i}\)
- average share of real investment in real GDP \(s_{i}\)
- real GDP in 1985 divided by the working-age population in 1985 \(y_{i}^{*}\)
Three samples of countries: 98 non-oil countries, 75 countries with better quality data (intermediate sample), 22 OECD countries with populations greater than 1 million
Assume that \(g+\delta=0.05\).

Replicating Table I of MRW (1992)

The model estimated was \[\log y_{i}^{*} = \beta_{0}+\beta_{1}\log s_{i}+\beta_{2}\log\left(n_{i}+g+\delta\right)+\varepsilon_{i}.\]

options(digits=3) # Learn to present to the appropriate precision
library(foreign)
MRW <- read.dta("./MRW.dta")
# Generate new variables
MRW$ly85 <- log(MRW$y85)
MRW$linv <- log(MRW$inv/100)
MRW$lpop <- log(MRW$pop/100 + 0.05)
MRW.TableI.nonoil <- lm(ly85 ~ linv + lpop, data = subset(MRW, MRW$n==1)) # Apply OLS

summary(MRW.TableI.nonoil)


Call:
lm(formula = ly85 ~ linv + lpop, data = subset(MRW, MRW$n == 
    1))

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7914 -0.3937  0.0412  0.4337  1.5805 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.430      1.584    3.43  0.00090 ***
linv           1.424      0.143    9.95  < 2e-16 ***
lpop          -1.990      0.563   -3.53  0.00064 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.689 on 95 degrees of freedom
Multiple R-squared:  0.601, Adjusted R-squared:  0.592 
F-statistic: 71.5 on 2 and 95 DF,  p-value: <2e-16

MRW essentially assume correct specification after log-linearization.
The covariance matrix (along with the standard errors) were calculated under conditional homoscedasticity because MRW imposed it through the independence assumption.
Are these plausible?

Testing the Solow growth model

The Solow hypothesis is \(\beta_1=-\beta_2\). Where did this come from?
What are \(R\) and \(q\) for the situation here? How many restrictions?
I show many versions of doing the same test, ranging from the most convenient to the more bespoke.

The easiest way is to use linearHypothesis() from the car package.

library(car)
linearHypothesis(MRW.TableI.nonoil, c("linv  + lpop"), rhs = 0, test = "Chisq")

Linear hypothesis test

Hypothesis:
linv  + lpop = 0

Model 1: restricted model
Model 2: ly85 ~ linv + lpop

  Res.Df  RSS Df Sum of Sq Chisq Pr(>Chisq)
1     96 45.5                              
2     95 45.1  1     0.396  0.83       0.36

It is possible to test the Solow hypothesis after a reparameterization.
Let \(\beta_1+\beta_2=\theta\). So, the Solow hypothesis becomes a test of \(\theta=0\).
The reparameterized model is now \[\log y_{i}^{*} = \beta_{0}+\beta_1\left[\log s_{i}-\log\left(n_{i}+g+\delta\right)\right]+\theta\log\left(n_{i}+g+\delta\right)+\varepsilon_{i}.\]

# Generate new variable
MRW$ldiff <- MRW$linv - MRW$lpop
# Apply OLS to reparameterized model
MRW.TableI.repar.nonoil <- lm(ly85 ~ ldiff + lpop, data = subset(MRW, MRW$n==1))
summary(MRW.TableI.repar.nonoil)[[4]] # Focusing on the summary of the coefficients

            Estimate Std. Error t value Pr(>|t|)
(Intercept)    5.430      1.584   3.428 9.00e-04
ldiff          1.424      0.143   9.951 2.10e-16
lpop          -0.566      0.619  -0.913 3.63e-01

Or, if you wish…

linearHypothesis(MRW.TableI.repar.nonoil, c("lpop"), rhs = 0, test = "Chisq")

Linear hypothesis test

Hypothesis:
lpop = 0

Model 1: restricted model
Model 2: ly85 ~ ldiff + lpop

  Res.Df  RSS Df Sum of Sq Chisq Pr(>Chisq)
1     96 45.5                              
2     95 45.1  1     0.396  0.83       0.36

Another approach is to use an SSR comparison as seen here.
You already know what the unrestricted model looks like.
The restricted model is the model where the Solow hypothesis is imposed, i.e. \[\log y_{i}^{*} = \beta_{0}+\beta_{1}\left[\log s_{i}-\log\left(n_{i}+g+\delta\right)\right]+\varepsilon_{i}.\]
Sometimes it may be difficult to write down what the restricted model would look like!

# Apply OLS to restricted regression
MRW.TableI.restricted.nonoil <- lm(ly85 ~ ldiff, data = subset(MRW, MRW$n==1))
# Compute test using sum of squared residual comparisons
# This is ok under conditional homoscedasticity
# available in base R
anova(MRW.TableI.restricted.nonoil, MRW.TableI.nonoil)

Analysis of Variance Table

Model 1: ly85 ~ ldiff
Model 2: ly85 ~ linv + lpop
  Res.Df  RSS Df Sum of Sq    F Pr(>F)
1     96 45.5                         
2     95 45.1  1     0.396 0.83   0.36

Use the theory here directly.

coef.TableI.nonoil <- coef(MRW.TableI.nonoil)
cov.TableI.nonoil <- vcov(MRW.TableI.nonoil)
R.mat <- c(0, 1, 1) # R matrix for testing Solow hypothesis
# Calculate F statistic for testing Solow hypothesis
test.stat <- t(R.mat %*% coef.TableI.nonoil) %*% solve(R.mat %*% cov.TableI.nonoil %*% R.mat) %*%
  R.mat %*% coef.TableI.nonoil
# Test statistic, p-value, critical value
c(test.stat, 1-pchisq(test.stat, 1), qchisq(0.95, 1))

[1] 0.834 0.361 3.841

# Finite-sample adjustments
c(test.stat, 1-pf(test.stat, 1, 95), qf(0.95, 1, 95))

[1] 0.834 0.363 3.941

MRW: Nonlinear functions of regression parameters

There seems to be support from the data that the Solow hypothesis holds.
\(\alpha\) is an important parameter in the Solow growth model. It is possible to compute its implied value.
In fact, MRW claims that \(\alpha=1/3\) based on a wide set of past studies on labor intensity in aggregate production.
To compute the implied \(\alpha\), the restricted regression would have to be the basis. How are \(\alpha\) and \(\beta_1\) related?

How do you obtain an estimate and a standard error for the estimate of the implied \(\alpha\)?
To get standard errors, we apply the delta method discussed here.

est.beta1 <- coef(MRW.TableI.restricted.nonoil)[[2]]
implied.alpha <- est.beta1/(1+est.beta1)
implied.alpha

[1] 0.598

est.var <- c(0, 1/(1+est.beta1)^2) %*% vcov(MRW.TableI.restricted.nonoil) %*% c(0, 1/(1+est.beta1)^2)
delta.method.se <- sqrt(est.var)
delta.method.se

       [,1]
[1,] 0.0201

# Large-sample 95% confidence interval for alpha
c(implied.alpha-1.96*delta.method.se, implied.alpha+1.96*delta.method.se)

[1] 0.559 0.638

We can also use the bootstrap to obtain the standard errors.

# Note this MRW file contains ly85, ldiff
nonoil <- subset(MRW, MRW$n==1)
boot.reps <- 99
boot.alpha <- numeric(boot.reps)
for(i in 1:boot.reps)
{
  MRW.TableI.restricted.nonoil <- lm(ly85 ~ ldiff, data = nonoil[sample(1:nrow(nonoil), replace=TRUE),]) 
  est.beta1 <- coef(MRW.TableI.restricted.nonoil)[[2]]
  boot.alpha[i] <- est.beta1/(1+est.beta1)
}
boot.se <- sd(boot.alpha)
boot.se

[1] 0.0171

# Large-sample 95% confidence interval for alpha
# Based on bootstrap SE
c(implied.alpha-1.96*boot.se, implied.alpha+1.96*boot.se)

[1] 0.565 0.632

We can also construct bootstrap confidence intervals.
We set things up and apply the double bootstrap.

innerboot.reps <- 99
outerboot.reps <- 999
inner.alpha <- numeric(innerboot.reps)
outer.alpha <- numeric(outerboot.reps)
boot.pivot <- numeric(outerboot.reps)
for(i in 1:outerboot.reps)
{
  outer.lm <- lm(ly85 ~ ldiff, data = nonoil[sample(1:nrow(nonoil), replace=TRUE),]) 
  outer.beta1 <- coef(outer.lm)[[2]]
  outer.alpha[i] <- outer.beta1/(1+outer.beta1)
  for(j in 1:innerboot.reps)
  {
    inner.lm <- lm(ly85 ~ ldiff, data = nonoil[sample(1:nrow(nonoil), replace=TRUE),]) 
    inner.beta1 <- coef(inner.lm)[[2]]
    inner.alpha[j] <- inner.beta1/(1+inner.beta1)
  }
  inner.boot.se <- sd(inner.alpha)
  boot.pivot[i] <- (outer.alpha[i]-implied.alpha)/inner.boot.se
}

# Compare with plus/minus 1.96
quantile(boot.pivot, c(0.975, 0.025))

97.5%  2.5% 
 1.76 -2.13

# Bootstrap 95% confidence interval
implied.alpha-quantile(boot.pivot, c(0.975, 0.025))*boot.se

97.5%  2.5% 
0.568 0.634

Unlike large-sample 95% confidence intervals, the confidence intervals based on the bootstrap (either bootstrapped SE or the bootstrap CI) are random.
If you want to present results in a paper or report, you have to use set.seed() for reproducibility.

Here is a version using boott() from the bootstrap package.

nonoil <- subset(MRW, MRW$n==1)
implied.alpha <- function(rownumbers, data)
{
  apply.OLS <- lm(ly85 ~ ldiff, data = data[rownumbers,]) 
  get.beta1 <- coef(apply.OLS)[[2]]
  return(get.beta1/(1+get.beta1))
}
library(bootstrap)
boott(1:nrow(nonoil), implied.alpha, nonoil, nbootsd = 99, nboott = 999, perc = c(.025,.975))$confpoints

     0.025 0.975
[1,] 0.558 0.628

Bottom line: MRW want to test whether \(\alpha=1/3\). Based on the 95% confidence intervals, we reject \(\alpha=1/3\).
Furthermore, MRW believe that a lot of empirical studies obtaining alternate estimates of \(\alpha\) but not imposing the Solow growth model could not all be wrong!
So, it is likely that the textbook Solow growth model is incompatible with the data.

You can continue exploring the paper for more on:
- Modifying the theory to account for the empirical evidence against \(\alpha=1/3\) which leads to the augmented Solow growth model
- Modifying the specification and re-estimating the econometric model (look at the new set of assumptions)
- The empirical results: which you can now reproduce
MRW also attempt to estimate the speed of convergence, i.e., how long will it take before poor countries catch up with rich countries. You can already explore this too!

How do you establish causality?

Causality relies on counterfactuals.
We have spent most of our time talking about making comparisons and these comparisons are either predicted differences or best linear approximations of the predicted differences.
When we made comparisons before, we were making comparisons across different subgroups – apples to oranges!
To establish causality, the comparisons have to be apples to apples and oranges to oranges!

The need for new language

In order to make apples to apples or oranges to oranges comparisons, we have to establish new language.
This means we need new notation to reflect the fact that we are comparing the outcomes of the same unit.
You will learn more about this new language in the next semester. I give a short preview here.

Building intuition

Let us tackle the simplest case. Suppose there is a policy \(D\) and you want to evaluate the effects of this policy.
How would you evaluate the effect of the policy in the first place?
You need to tie the policy to outcomes which could be measured. We can focus on a single outcome \(Y\).
We also need to define what the effect actually looks like.

Let \(t\) index units. We have a triple \((Y,D,U)\) but we can only observe \((Y,D)\) for purposes of evaluating policy.
We are going to set up a causal model or a structural equation or a response schedule of the form: \[Y_t\left(D_t,U_t\right)=\alpha+\lambda D_t+\gamma U_t\]
Define the causal effect of \(D\) on \(Y\) as \[Y_t\left(d+1,U_t\right)-Y_t\left(d,U_t\right)\]
The causal effect for this particular case of a structural equation is given by \(\lambda\).

The million-dollar question is really what happens when I run a regression of \(Y\) on \(D\). Do I recover \(\lambda\)?
You already know that under IID conditions, \[\widehat{\beta}_1 \overset{p}{\to} \frac{\mathsf{Cov}\left(Y_t,D_t\right)}{\mathsf{Var}\left(D_t\right)}\]
The issue is if the right hand side actually equals \(\lambda\)!
If it does, then you can use linear regression to recover the causal effect. If it does not, then you use something else, think a bit harder, or make plausible assumptions which will allow you to still use linear regression to recover \(\lambda\).

You can show that \[\frac{\mathsf{Cov}\left(Y_t,D_t\right)}{\mathsf{Var}\left(D_t\right)}=\lambda+\gamma\frac{\mathsf{Cov}\left(U_t,D_t\right)}{\mathsf{Var}\left(D_t\right)}\]
Thus, if \(\mathsf{Cov}\left(U_t,D_t\right)=0\), then you can recover \(\lambda\). But \(U\) is in the structural equation rather than in the assumption-lean linear regression model of the form \[Y_t=\beta_0+\beta_1 D_t+\varepsilon_t\]
So, even if \(\mathsf{Cov}\left(\varepsilon_t,D_t\right)=0\), we still could not recover \(\lambda\).

Even if we use Version 2 of the correctly specified linear regression model so that \(\mathbb{E}\left(\varepsilon_t|D_t\right)=0\), we still could not recover \(\lambda\), because what we need is \(\mathsf{Cov}\left(U_t,D_t\right)=0\).
Statistical errors \(\varepsilon\) is different from unobserved errors in the causal model \(U\)!
This was recently reiterated in the main point of Crudu et al (2022). I think you will have enough background to read this paper now.

Another formulation

Given the previous discussion, we will assume that \(\gamma\) equal to zero so that we can focus on what linear regression recovers in a slightly different context.
Assume this time that \(D\in\{0,1\}\).
We now consider a structural equation of the form \[Y_t\left(D_t\right)=\alpha_t+\lambda_t D_t\]

Here we have \(Y_t\left(0\right)=\alpha_t\). This represents heterogeneity across the baseline.
We also have \(Y_t\left(1\right)-Y_t\left(0\right)=\lambda_t\). This represents heterogeneity in causal effects.
What would linear regression of \(Y\) on \(D\) now recover in large samples?
Just like before, you already know that under IID conditions, \[\widehat{\beta}_1 \overset{p}{\to} \frac{\mathsf{Cov}\left(Y_t,D_t\right)}{\mathsf{Var}\left(D_t\right)}\]

Here \[Y_t=\underbrace{\mathbb{E}\left(\alpha_t\right)}_{\alpha}+\underbrace{\mathbb{E}\left(\lambda_t\right)}_{\lambda}D_t+\underbrace{\left[\alpha_t-\mathbb{E}\left(\alpha_t\right)+\left(\lambda_t-\mathbb{E}\left(\lambda_t\right)\right)D_t\right]}_{U_t}\]
So, this is very similar to the structural equation you saw before!
Thus, if \(\mathsf{Cov}\left(U_t,D_t\right)=0\), then you can recover \(\mathbb{E}\left(\lambda_t\right)\). But this \(\lambda\) is the population average of unit-specific causal effects!

Randomized controlled trials

The new language and the framework we just discussed provides us what to look for in real applications.
The idea is to ensure \(\mathsf{Cov}\left(U_t,D_t\right)=0\).
Randomized controlled trials (RCTs) provide a setting to achieve this.
- Here, investigators assign subjects at random to a treatment group \(D=1\) or to a control group \(D=0\).
- Data collection can be expensive. Ethical issues do arise.

“As-if” randomization

If designed properly, RCTs will ensure the independence of \(U\) and \(D\), which will then imply \(\mathsf{Cov}\left(U_t,D_t\right)=0\).
What if you cannot achieve by design or cannot guarantee \(\mathsf{Cov}\left(U_t,D_t\right)=0\)?
One approach is to find \(X\)’s as control variables in order to achieve a form of conditional independence: \(U\) and \(D\) are independent conditional on \(X\).
Some books call this the CIA or selection on observables.

To get the crucial idea, assume that \(X\in\{0,1\}\) is discrete and that \(X\) is the appropriate control variable.
Again, use linear regression of \(Y\) on \(D\) and \(X\). What does the regression slope of \(D\), which we denote by \(\widehat{\beta}_1\) recover?
Just like before, you already know that with FWL and under IID conditions, \[\widehat{\beta}_1 \overset{p}{\to} \frac{\mathsf{Cov}\left(Y_t, \widetilde{D}_t\right)}{\mathsf{Var}\left(\widetilde{D}_t\right)}\]

Here \(\widetilde{D}_t=D_t-\gamma_0-\gamma_1 X_t\) which is the error from the best linear prediction of \(D_t\) given \(X_t\).
This should sound familiar as this is the population version of FWL, which is called the regression anatomy in Mastering Metrics and Mostly Harmless Econometrics.
More importantly, because \(D\) and \(X\) are both binary (dummies), we are in the case of a saturated model. So the CEF is equal to the BLP, i.e. \[\widetilde{D}_t=D_t-\mathbb{E}\left(D_t|X_t\right)\]

This specific setting is quite special. For example, if \(X\) were a continuous r.v., then you bring the requirement of correct specification if you state \[\widetilde{D}_t=D_t-\mathbb{E}\left(D_t|X_t\right)\]
You now have to figure out what you learn from computing \(\widehat{\beta}_1\) when you have more and more observations. What causal effect are you recovering? Is that of interest?

You can show, by the law of total variance and the properties of the CEF error, that \[\begin{eqnarray*}&& \mathsf{Var}\left(\widetilde{D}_t\right) \\ &=& \mathsf{Var}\left(\widetilde{D}_t|X_t=1\right)\mathbb{P}\left(X_t=1\right)+\mathsf{Var}\left(\widetilde{D}_t|X_t=0\right)\mathbb{P}\left(X_t=0\right) \\ &=& \mathsf{Var}\left(D_t|X_t=1\right)\mathbb{P}\left(X_t=1\right)+\mathsf{Var}\left(D_t|X_t=0\right)\mathbb{P}\left(X_t=0\right) \\ &=& \underbrace{\mathbb{P}\left(D_t=1|X_t=1\right)}_{p_D\left(1\right)}\left(1-\mathbb{P}\left(D_t=1|X_t=1\right)\right)\mathbb{P}\left(X_t=1\right) \\ && + \underbrace{\mathbb{P}\left(D_t=1|X_t=0\right)}_{p_D\left(0\right)}\left(1-\mathbb{P}\left(D_t=1|X_t=0\right)\right)\mathbb{P}\left(X_t=0\right)\end{eqnarray*} \]

You can also show, that \[\begin{eqnarray*} &&\mathsf{Cov}\left(Y_t,\widetilde{D}_t | X_t\right) \\ &=& \mathbb{E}\left(Y_t\widetilde{D}_t|X_t\right)-\mathbb{E}\left(Y_t|X_t\right)\mathbb{E}\left(\widetilde{D}_t|X_t\right)\\ &=& \mathbb{E}\left(\left(\alpha_t+\lambda_tD_t\right)\widetilde{D}_t|X_t\right)\\ &=&\mathsf{Cov}\left(\alpha_t, D_t|X_t\right)+\mathsf{Cov}\left(\lambda_tD_t,D_t | X_t\right) \\ &=&\mathbb{E}\left(\lambda_t D_t^2|X_t\right)-\mathbb{E}\left(\lambda_t D_t|X_t\right)\mathbb{E}\left( D_t|X_t\right) \\ &=& \mathbb{E}\left(\lambda_t D_t|X_t\right)-\mathbb{E}\left(\lambda_t D_t|X_t\right)\mathbb{E}\left(D_t|X_t\right) \\ &=& \mathbb{E}\left(\lambda_t|X_t\right)\mathbb{E}\left( D_t|X_t\right)- \mathbb{E}\left(\lambda_t|X_t\right)\left(\mathbb{E}\left( D_t|X_t\right)\right)^2\\ &=& \mathbb{E}\left(\lambda_t|X_t\right)\mathbb{E}\left( D_t|X_t\right)\left(1-\mathbb{E}\left( D_t|X_t\right)\right)\end{eqnarray*}\]

Therefore, \[\begin{eqnarray*}&&\mathsf{Cov}\left(Y_t,\widetilde{D}_t\right)\\&=& \mathbb{E}\left(Y_t\widetilde{D}_t\right)-\mathbb{E}\left(Y_t\right)\mathbb{E}\left(\widetilde{D}_t\right) \\ &=& \mathbb{E}\left[\mathbb{E}\left(Y_t\widetilde{D}_t|X_t\right)\right]\\ &=& \mathbb{E}\left(Y_t\widetilde{D}_t|X_t=1\right)\mathbb{P}\left(X_t=1\right)+\mathbb{E}\left(Y_t\widetilde{D}_t|X_t=0\right)\mathbb{P}\left(X_t=0\right)\\ &=& \mathbb{E}\left(\lambda_t|X_t=1\right)\mathbb{E}\left( D_t|X_t=1\right)\left(1-\mathbb{E}\left( D_t|X_t=1\right)\right)\mathbb{P}\left(X_t=1\right) \\ &&+\mathbb{E}\left(\lambda_t|X_t=0\right)\mathbb{E}\left( D_t|X_t=0\right)\left(1-\mathbb{E}\left( D_t|X_t=0\right)\right)\mathbb{P}\left(X_t=0\right)\\ &=& \mathbb{E}\left(\lambda_t|X_t=1\right)p_D\left(1\right)\left(1-p_D\left(1\right)\right)\mathbb{P}\left(X_t=1\right) \\ && +\mathbb{E}\left(\lambda_t|X_t=0\right)p_D\left(0\right)\left(1-p_D\left(0\right)\right)\mathbb{P}\left(X_t=0\right)\end{eqnarray*}\]

Putting all of these together, we have \[\begin{eqnarray*}\widehat{\beta}_1 &\overset{p}{\to}& \frac{\mathsf{Cov}\left(Y_t,\widetilde{D}_t\right)}{\mathsf{Var}\left(\widetilde{D}_t\right) } \\ &=& \frac{\displaystyle\sum_{x\in\{0,1\}} \mathbb{E}\left(\lambda_t|X_t=x\right)p_D\left(x\right)\left(1-p_D\left(x\right)\right)\mathbb{P}\left(X_t=x\right)}{\displaystyle\sum_{x\in\{0,1\}}p_D\left(x\right)\left(1-p_D\left(x\right)\right)\mathbb{P}\left(X_t=x\right)}\end{eqnarray*}\]

Therefore, OLS recovers a conditional average causal effect subject to variance weighting with respect to \(D\). The question is whether this is of interest to your research question.
If \(\mathbb{E}\left(\lambda_t|X_t=x\right)\) is the same across all \(x\), then OLS does recover an average causal effect \(\mathbb{E}\left(\lambda_t\right)\).
There are two issues:
- What \(X\)’s should you include?
- Even if you found the right set of \(X\)’s, are you sure that treatment and control groups are comparable?

What \(X\) should we include?

Given what you know so far, there is absolutely no guidance for this.
In fact, the answer is “it depends” on the goal.
What I show you here is what to think about when the goal is to recover causal effects.
Assume \(U_X\), \(U_C\), \(U_Y\) are independent of each other. There are three basic cases to consider:

Case 1 (the chain): \(U_X\sim N\left(0,1\right)\), \(U_Y\sim N\left(0,1\right)\), \(U_C\sim N\left(0,1\right)\), \(X=U_X\), \(C=5X+U_C\), \(Y=3C+U_Y\)
Case 2 (the fork): \(U_X\sim N\left(0,1\right)\), \(U_Y\sim N\left(0,1\right)\), \(U_C\sim N\left(100,15\right)\), \(C=U_C\), \(X=200-C+U_X\), \(Y=0.5C+0.1X+U_Y\)
Case 3 (the collider): \(U_X\sim N\left(0,1\right)\), \(U_Y\sim N\left(0,1\right)\), \(U_C\sim Ber(0.05)\), \(X=U_X\), \(Y=U_Y\), \(\widetilde{C}=\begin{cases}1 & X>1\:\mathrm{or}\:Y>1\\ 0 & \mathrm{otherwise} \end{cases}\), \(C=\left(1-U_C\right)\widetilde{C}+U_C\left(1-\widetilde{C}\right)\)

Case 1

Case 2

Case 3

What if “as-if” randomization cannot be achieved?

Why did we do all that algebra? Because the derivation lays out what to look out for if one uses a regression strategy to recover causal effects.
Things can fail and it can be traced to two reasons:
- Lack of balance
- Lack of complete overlap
In the next semester, I think you will be looking more into these issues. I will jump to a different strategy called instrumental variables (IV).

Instrumental variable strategies

There are roughly three ways to look at instrumental variables (IV):
- It is a strategy to deal with endogenous regressors. This leads us to distinguish between linear regressions and linear structural equations.
- It is a component of a general-purpose approach to estimating and conducting inference on parameters of interest.
- It is a strategy to identify causal effects when CIA does not hold.

Example of linear structural equations: setup

Consider \[\begin{eqnarray}C_t &=& \beta_0+\beta_1 I_t+\varepsilon_t \\ I_t &=&C_t+D_t,\end{eqnarray}\] where \(C_t\) is consumption, \(I_t\) is income, and \(D_t\) is non-consumption.

Assume, for convenience, that \[\left(\begin{array}{c} D\\ \varepsilon \end{array}\right)\sim N\left(\left(\begin{array}{c} \mu_{D}\\ 0 \end{array}\right),\left(\begin{array}{cc} \sigma_{D}^{2} & 0\\ 0 & \sigma_{\varepsilon}^{2} \end{array}\right)\right).\] Assume we have IID draws from this distribution.
You can easily check that \(\mathbb{E}\left(I_t\varepsilon_t\right)= \sigma^2_{\varepsilon}/\left(1-\beta_1\right)\).
Therefore, \(\mathbb{E}\left(I_t\varepsilon_t\right)\neq 0\) whenever \(\sigma^2_{\varepsilon}>0\).
So, if you apply least squares to a linear regression of \(C_t\) on \(I_t\), you will not recover \(\beta_1\) no matter how large the sample size is.

Example of linear structural equations: source of the problem

You can show that the BLP of \(C\) given \(I\) is \(\gamma_0+\gamma_1 I_t\) where \[\begin{eqnarray} \gamma_1 &=& \frac{\beta_1\sigma^2_D+\sigma^2_{\varepsilon}}{\sigma^2_D+\sigma^2_{\varepsilon}}=\theta\beta_1+\left(1-\theta\right) \\ \gamma_0 &=& \frac{\beta_0+\beta_1\mu_D-\gamma_1\left(\beta_0+\mu_D\right)}{1-\beta_1}=\theta\beta_0-\left(1-\theta\right)\mu_D, \end{eqnarray}\]where \(\theta = \sigma^2_D/\left(\sigma^2_D+\sigma^2_{\varepsilon}\right)\).

Thus, we have two representations of the relationship between \(C_t\) and \(I_t\):
- One which is always true: \(C_t=\gamma_0+\gamma_1 I_t+\eta_t\), where \(\eta_t\) is a BLP error.
- The other is assumed to be true: \(C_t=\beta_0+\beta_1I_t+\varepsilon_t\), where \(\varepsilon_t\) is a structural error and not a CEF error.
What can you estimate?
- OLS estimates \(\gamma_0\) and \(\gamma_1\) consistently.
- But OLS cannot estimate \(\beta_0\) and \(\beta_1\) consistently. We need an alternative method.

Example of linear structural equations: recovering \(\beta_0\) and \(\beta_1\)

Some terminology:
- \(I_t\) is called an endogenous variable or endogenous regressor.
- \(\varepsilon_t\) is called a structural error.
- \(C_t=\beta_0+\beta_1I_t+\varepsilon_t\) is called a structural equation or structural model or response schedule.
- \(\beta_0\) and \(\beta_1\) are called structural parameters.

To recover \(\beta_1\), the idea is to have an identification argument:
- Observe that \[\begin{eqnarray} \mathsf{Cov}\left(C_t, D_t\right) &=& \beta_1\mathsf{Cov}\left(I_t, D_t\right)+\mathsf{Cov}\left(D_t,\varepsilon_t\right) \\ \mathsf{Cov}\left(C_t, D_t\right) &=& \beta_1\mathsf{Cov}\left(I_t, D_t\right) \\ \beta_1 &=&\frac{\mathsf{Cov}\left(C_t, D_t\right)}{\mathsf{Cov}\left(I_t, D_t\right)} \end{eqnarray}\]
- The argument works when \(\mathsf{Cov}\left(D_t,\varepsilon_t\right)=0\) and \(\mathsf{Cov}\left(I_t,D_t\right)\neq 0\).
- Thus, we can uniquely recover \(\beta_1\).

To recover \(\beta_0\), the idea is to have another identification argument:
- Observe that \[\begin{eqnarray} \mathbb{E}\left(C_t\right) &=& \beta_0+\beta_1 \mathbb{E}\left(I_t\right)+ \mathbb{E}\left(\varepsilon_t\right) \\ \mathbb{E}\left(C_t\right) &=& \beta_0+\beta_1 \mathbb{E}\left(I_t\right)\\ \beta_0 &=& \mathbb{E}\left(C_t\right)-\beta_1 \mathbb{E}\left(I_t\right)\end{eqnarray}\]
Provided that \(\mathbb{E}\left(\varepsilon_t\right)=0\) and \(\beta_1\) is uniquely identified, then we can also uniquely identify \(\beta_0\).

The method of moments

From our example, it turns out that we mainly use two equations to recover \(\beta_0\) and \(\beta_1\): \[\begin{eqnarray*}\mathbb{E}\left(\varepsilon_t\right) &=& 0 \\ \mathsf{Cov}\left(D_t,\varepsilon_t\right) &=& 0 \end{eqnarray*}\]
These equations may be rewritten as \[\begin{eqnarray*}\mathbb{E}\left(\varepsilon_t\right) &=& 0 \\ \mathbb{E}\left(D_t\varepsilon_t\right) &=& 0 \end{eqnarray*}\]

To emphasize the dependence on \(\beta_0\) and \(\beta_1\), we can write them as: \[\begin{eqnarray*}\mathbb{E}\left(C_t-\beta_0-\beta_1 I_t\right) &=& 0 \\ \mathbb{E}\left(D_t\left(C_t-\beta_0-\beta_1 I_t\right)\right) &=& 0 \end{eqnarray*}\]
We can further rewrite this in a more general form, by taking \(Z_t^\prime \leftarrow \left(1, D_t\right)\), \(X_t^\prime \leftarrow \left(1, I_t\right)\), \(Y_t \leftarrow C_t\), and \(\beta \leftarrow \left(\beta_0,\beta_1\right)^\prime\).
We have orthogonality conditions of the form \(\mathbb{E}\left(Z_t\varepsilon_t\right)=0\). This is the exogeneity of the instrument vector \(Z_t\) with respect to \(\varepsilon_t\).

We can also write \[\begin{eqnarray*}\mathbb{E}\left(Z_tX_t^\prime\right)\beta &=& \mathbb{E}\left(Z_tY_t\right) \\ \beta &=& \left(\mathbb{E}\left(Z_tX_t^\prime\right)\right)^{-1}\mathbb{E}\left(Z_tY_t\right)\end{eqnarray*}\] provided that \(\mathbb{E}\left(Z_tX_t^\prime\right)\) is invertible.
In the linear structural equation case, we reached a point where we can explore a more general case: \[\underbrace{\mathbb{E}\left(Z_tX_t^\prime\right)}_{\left(L\times K\right)}\underbrace{\beta}_{\left(K\times 1\right)}=\underbrace{\mathbb{E}\left(Z_tY_t\right)}_{\left(L\times 1\right)}.\]

What happens if \(L>K\): building intuition

Typical instrumental variable applications have \(L=K\). But \(L>K\) is definitely possible.
The case where \(L=K\) and invertible \(\mathbb{E}\left(Z_tX_t^\prime\right)\) is sometimes called just identified case or exactly identified case.
The case where \(L<K\) is called the underidentified case.

If \(L>K\), then we have more moment conditions than the dimension of the parameters to be estimated. Intuitively, we should be able to exploit all \(L\) moment conditions.
The problem with \(L>K\), we can no longer invert \(\mathbb{E}\left(Z_tX_t^\prime\right)\) because it is no longer a square matrix.
We will replace the invertibility by a rank assumption, i.e., the rank of \(\mathbb{E}\left(Z_tX_t^\prime\right)\) should be equal to \(K\).
To gain insight on this and to determine what to do when \(L>K\), we should discuss what the reduced form is.

What happens if \(L>K\): reduced forms

We will be augmenting the linear structural equation by a reduced form.
A reduced form is a set of equations where you express the endogenous variables in terms of the exogenous variables. Reduced forms are linear regressions in the modern sense!

So, we have \[\begin{eqnarray} Y_t&=&X_t^\prime \beta+\varepsilon_t \\ X_t^\prime &=&Z_t^\prime \gamma+v_t^\prime. \end{eqnarray}\]
The first equation is assumed to be true, while the second equation is always true.
Note that \(\mathbb{E}\left(X_t\varepsilon_t\right)\neq 0\) and by construction, \(\mathbb{E}\left(Z_tv_t\right)=0\).
We still have \(\mathbb{E}\left(Z_t\varepsilon_t\right)=0\), as \(Z_t\) is exogenous with respect to \(\varepsilon_t\).

Substituting the reduced form into the structural equation, we will obtain \[\begin{eqnarray}Y_{t} &=& \left(Z_{t}^{\prime}\gamma+v_{t}^{\prime}\right)\beta+\varepsilon_{t} \\ Y_{t} &=& \left(Z_{t}^{\prime}\gamma\right)\beta+\left(v_{t}^{\prime}\beta+\varepsilon_{t}\right)\end{eqnarray}\]
Observe that the last equation is a linear regression. Check that \(\mathbb{E}\left[\left(Z_{t}^{\prime}\gamma\right)\left(v_{t}^{\prime}\beta+\varepsilon_{t}\right)\right]=0\).

We can write an expression for \(\beta\):\[\begin{eqnarray} \beta &=& \left\{ \mathbb{E}\left[\left(Z_{t}^{\prime}\gamma\right)^{\prime}\left(Z_{t}^{\prime}\gamma\right)\right]\right\} ^{-1}\mathbb{E}\left[\left(Z_{t}^{\prime}\gamma\right)^{\prime}Y_{t}\right]\\ &=& \left(\gamma^{\prime}\mathbb{E}\left(Z_{t}Z_{t}^{\prime}\right)\gamma\right)^{-1}\gamma^{\prime}\mathbb{E}\left(Z_{t}Y_{t}\right)\end{eqnarray}\]
But what is \(\gamma\)? It contains all BLP coefficients! It has the form \[\gamma = \left(\mathbb{E}\left(Z_{t}Z_{t}^{\prime}\right)\right)^{-1}\mathbb{E}\left(Z_{t}X_{t}^{\prime}\right).\]

What happens if \(L>K\): two-stage least squares estimand

All together, we have the two-stage least squares (2SLS) estimand \[ \bigg(\underbrace{\mathbb{E}\left(X_{t}Z_{t}^{\prime}\right)}_{Q_{ZX}^{\prime}}(\underbrace{\mathbb{E}\left(Z_{t}Z_{t}^{\prime}\right)}_{Q_{ZZ}})^{-1}\underbrace{\mathbb{E}\left(Z_{t}X_{t}^{\prime}\right)}_{Q_{ZX}}\bigg)^{-1}\mathbb{E}\left(X_{t}Z_{t}^{\prime}\right)\left(\mathbb{E}\left(Z_{t}Z_{t}^{\prime}\right)\right)^{-1}\mathbb{E}\left(Z_{t}Y_{t}\right)\]
This population quantity tells us that we can identify \(\beta\) using exogenous variation from \(Z\).
This means that a different \(Z\) will produce a different meaning for \(\beta\).

Provided that we have
- Exogeneity: \(Z\) is exogenous with respect to \(\varepsilon\)
- Relevance: \(Q_{ZX}\) has rank equal to \(K\)
- Invertibility: \(Q_{ZZ}\) is nonsingular

we have uniquely identified \(\beta\) and expressed it in terms of observable quantities.

When exogeneity is violated, then this IV strategy fails.
When relevance is violated, then either you have not enough excluded instruments (should not be present in the structural form) or you have enough excluded instruments but they do not produce enough exogenous variation.
- The first case is underidentification and the second case is called the problem of weak instruments.
Invertibility is less of an issue but it does create instability in the numerical results. I show you an example at the end of the slides.

What happens if \(L>K\): 2SLS estimator

The 2SLS estimator can be computed directly using the method of moments principle and it can be written as \[\widehat{\beta}_{2SLS}=\left(\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}\mathbf{Y}\]

Very similar to the setup for the OLS estimator, the 2SLS estimator matrix setup is also connected to a spreadsheet which looks like: \[\underset{\left(n\times1\right)}{\mathbf{Y}}=\left(\begin{array}{c} Y_{1}\\ \vdots\\ Y_{n} \end{array}\right),\underset{\left(n\times K\right)}{\mathbf{X}}=\left(\begin{array}{c} X_{1}^{\prime}\\ \vdots\\ X_{n}^{\prime} \end{array}\right), \underset{\left(n\times L\right)}{\mathbf{Z}}=\left(\begin{array}{c} Z_{1}^\prime\\ \vdots\\ Z_{n}^\prime \end{array}\right)\]

Why it is called 2SLS?
- The 2SLS estimand already gives you a hint. that estimand uses best linear prediction twice.
- The 2SLS estimator itself has some nice algebra to uncover. In particular, observe that \[\begin{eqnarray} \widehat{\beta}_{2SLS} &=& (\mathbf{X}^{\prime}\underbrace{\mathbf{Z}(\mathbf{Z}^{\prime}\mathbf{Z})^{-1}\mathbf{Z}^{\prime}}_{P_{\mathbf{Z}}}\mathbf{X})^{-1}\mathbf{X}^{\prime}\mathbf{Z}\left(\mathbf{Z}^{\prime}\mathbf{Z}\right)^{-1}\mathbf{Z}^{\prime}Y\\ & = & \left(\mathbf{X}^{\prime}P_{\mathbf{Z}}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}P_{\mathbf{Z}}Y\\ & = &\left(\mathbf{X}^{\prime}P_{\mathbf{Z}}^{\prime}P_{\mathbf{Z}}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}P_{\mathbf{Z}}^{\prime}P_{\mathbf{Z}}Y\\ & = &\left[\left(P_{\mathbf{Z}}\mathbf{X}\right)^{\prime}P_{\mathbf{Z}}\mathbf{X}\right]^{-1}\left(P_{\mathbf{Z}}\mathbf{X}\right)^{\prime}P_{\mathbf{Z}}Y \end{eqnarray}\]

Thus, you compute the fitted values from a least squares fit of \(\mathbf{X}\) on \(\mathbf{Z}\). This gives you \(P_{\mathbf{Z}}\mathbf{X}\). The next stage is to compute the coefficient vector from the least squares fit of \(P_{\mathbf{Z}}Y\) on \(P_{\mathbf{Z}}\mathbf{X}\).
This two-stage interpretation is purely least squares algebra.
In practice, literally applying OLS in two stages is ONLY useful for getting the 2SLS estimator. Do NOT mess with this.
But the “closer” \(\mathbf{X}\) is to \(P_{\mathbf{Z}}\mathbf{X}\), the more 2SLS will look like OLS! This means that the instrumental variables strategy will fail!

Generalizing the method of moments

It turns out that the 2SLS estimator is a special case of a more general class of estimators called generalized method of moments (GMM) estimators.
It can be shown that GMM applied to linear structural equations with orthogonality conditions of the form \(\mathbb{E}\left(Z_t\varepsilon_t\right)=0\) is given by \[\widehat{\beta}_{GMM}=\left(\mathbf{X}^{\prime}\mathbf{Z}\widehat{W}^{-1}\mathbf{Z}^{\prime}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Z}\widehat{W}^{-1}\mathbf{Z}^{\prime}\mathbf{Y}\]

\(\widehat{W}\) is called a weighting matrix. It “weighs” the moment conditions in a particular way.
It is possible to choose this weighting matrix in an optimal way. This means that we can choose \(\widehat{W}\) so that in the class of GMM estimators satisfying \(\mathbb{E}\left(Z_t\varepsilon_t\right)=0\), we can find the estimator with the lowest variance.

This development of generalizing the method of moments was discovered in Hansen (1982) and has primarily been applied in macroeconomics.
We do not pursue the ideas here anymore, but make a note that choosing \(\widehat{W}\) as a constant proportion of \((\mathbf{Z}^{\prime}\mathbf{Z}\), is actually optimal under IID and conditional homoscedasticity.
Curiously, practitioners still use 2SLS even if it is not efficient under conditional heteroscedasticity.

How do we test whether the exogeneity holds?

We want to test the validity of ALL moment conditions \(\mathbb{E}\left(Z_t\varepsilon_t\right)=0\). This would be the null hypothesis.
The alternative hypothesis is that one or more of these \(L\) moment conditions is violated.
We are only able to test the model when \(L>K\). What happens when \(L<K\)? \(L=K\)?

The idea is to check how far \[\frac{1}{n}\sum_{t=1}^n Z_t\varepsilon_t\] is from zero (Why zero?).
It turns out that for the linear structural equation case with conditional homoscedasticity, a convenient procedure is available.

Run a regression of the 2SLS residuals on all the instruments.
Calculate the R-squared from that regression. Compute \(J=n\times R^2\). This is called a Sargan statistic.
Reject the null at the \(100\alpha\) level when \(J\) is greater than the relevant quantile of the \(\chi^2_{L-K}\), i.e. the quantile from \(\mathbb{P}\left(\chi^2_{L-K} > c\right)=\alpha\).
Of course, you can also report either a right-tailed \(p\)-value.
We also sometimes call it a test of overidentifying restrictions or overidentifying restrictions test or \(J\)-test (Hansen 1982).

If you reject the null, the test does not tell you which moment condition is incompatible with the data.
If you do not reject the null, it does not necessarily mean that exogeneity holds.
There are developments that look at subset exogeneity tests, but they require that you already know that some instruments are exogenous in the first place.
In the GMM context, the test of exogeneity is really a test of model specification.

Angrist and Krueger (1991)

The idea is to estimate the returns to schooling by running a regression of log wages on years of schooling.
The problem is that years of schooling is not exogenous with respect to unobserved determinants of wages.
When you read an IV/2SLS paper, you need a proof of concept, i.e. what is the source of exogenous variation? Figure I, II, and III provide this proof of concept.

They estimate the following structural and reduced forms from page 997: \[\begin{eqnarray*}\ln W_i &=& X_i\beta+\sum_c Y_{ic}\xi_c+\rho E_i+\mu_i \\ E_i &=& X_i \pi+\sum_c Y_{ic}\delta_c+\sum_c\sum_j Y_{ic}Q_{ij}\theta_{jc}+\epsilon_i \end{eqnarray*}\]
If \(\mu_i\) is correlated (for whatever reasons) with \(E_i\), then applying OLS will produce an inconsistent estimator for \(\rho\).
The first equation is the wage equation treated as structural and the second equation is the reduced form for the suspected endogenous regressor \(E_i\).

This paper is a classic because it started the use of instrumental variables as a strategy to recover causal effects.
Instrumental variables actually have been around since the 1920s. Read this retrospective to get a sense of the discovery!
This paper also produced a lot of theoretical research (that is still ongoing until now!) on statistical problems with the instrumental variables strategy.
- How do you deal with weak instruments?
- What if there is conditional heteroscedasticity?
- What about the nonlinearity of the structural form? Check this for a recent revisit.
- Do these three issues affect each other?

AK91: Loading data

library(foreign)
NEW7080 <- read.dta("NEW7080.dta")
# Renaming variables
names(NEW7080) <- c("AGE", "AGEQ", "v3", "EDUC", 
  "ENOCENT", "ESOCENT", "v7", "v8", "LWKLYWGE", "MARRIED", 
  "MIDATL", "MT", "NEWENG", "v14", "v15", "CENSUS", "v17", 
  "QOB", "RACE", "SMSA", "SOATL", "v22", "v23", 
  "WNOCENT", "WSOCENT", "v26", "YOB")
ak2029 <- subset(NEW7080, NEW7080$YOB>=1920 & NEW7080$YOB<=1929)
ak2029$AGEQSQ <- ak2029$AGEQ^2

AK91: Applying OLS using `lm()`

ols2029.1 <- lm(LWKLYWGE~EDUC+factor(YOB), data=ak2029)
ols2029.2 <- lm(LWKLYWGE~EDUC+AGEQ+AGEQSQ+factor(YOB), 
                data=ak2029)
ols2029.3 <- lm(LWKLYWGE~EDUC+RACE+MARRIED+SMSA
                +NEWENG+MIDATL+ENOCENT+WNOCENT
                +SOATL+ESOCENT+WSOCENT+MT
                +factor(YOB), data=ak2029)
ols2029.4 <- lm(LWKLYWGE~EDUC+RACE+MARRIED+SMSA
                +NEWENG+MIDATL+ENOCENT+WNOCENT
                +SOATL+ESOCENT+WSOCENT+MT
                +AGEQ+AGEQSQ+factor(YOB), data=ak2029)

AK91: Applying 2SLS using `ivreg()`

library(AER)
iv2029.1 <- ivreg(LWKLYWGE~EDUC+factor(YOB) 
                  | factor(YOB)/factor(QOB), data=ak2029)
iv2029.2 <- ivreg(LWKLYWGE~EDUC+AGEQ+AGEQSQ
                  +factor(YOB) | factor(YOB)/factor(QOB)+AGEQ+AGEQSQ, data=ak2029)
iv2029.3 <- ivreg(LWKLYWGE~EDUC+RACE+MARRIED
                  +SMSA+NEWENG+MIDATL+ENOCENT
                  +WNOCENT+SOATL+ESOCENT+WSOCENT
                  +MT+factor(YOB) | factor(YOB)/factor(QOB)
                  +RACE+MARRIED+SMSA+NEWENG+MIDATL
                  +ENOCENT+WNOCENT+SOATL+ESOCENT
                  +WSOCENT+MT, data=ak2029)
iv2029.4 <- ivreg(LWKLYWGE~EDUC+RACE+MARRIED
                  +SMSA+NEWENG+MIDATL+ENOCENT
                  +WNOCENT+SOATL+ESOCENT+WSOCENT
                  +MT+AGEQ+AGEQSQ+factor(YOB) 
                  | factor(YOB)/factor(QOB)
                  +RACE+MARRIED+SMSA+NEWENG+MIDATL
                  +ENOCENT+WNOCENT+SOATL+ESOCENT
                  +WSOCENT+MT+AGEQ+AGEQSQ, data=ak2029)

AK91: IV/2SLS diagnostics

summary(iv2029.1, diagnostics=TRUE)[[12]]

                 df1    df2 statistic  p-value
Weak instruments  30 247159    4.5985 8.84e-16
Wu-Hausman         1 247187    0.0483 8.26e-01
Sargan            29     NA   36.0226 1.73e-01

summary(iv2029.2, diagnostics=TRUE)[[12]]

                 df1    df2 statistic p-value
Weak instruments  28 247159      1.08   0.346
Wu-Hausman         1 247185      2.52   0.112
Sargan            29     NA     25.59   0.647

summary(iv2029.3, diagnostics=TRUE)[[12]]

                 df1    df2 statistic  p-value
Weak instruments  30 247148    4.5534 1.52e-15
Wu-Hausman         1 247176    0.0469 8.28e-01
Sargan            29     NA   34.1648 2.33e-01

summary(iv2029.4, diagnostics=TRUE)[[12]]

                 df1    df2 statistic p-value
Weak instruments  28 247148     1.025   0.428
Wu-Hausman         1 247174     0.864   0.353
Sargan            29     NA    28.809   0.475

AK91: tables

models <- list()
models[['(1) OLS']] <- ols2029.1
models[['(2) TSLS']] <- iv2029.1
models[['(3) OLS']] <- ols2029.2
models[['(4) TSLS']] <- iv2029.2
models[['(5) OLS']] <- ols2029.3
models[['(6) TSLS']] <- iv2029.3
models[['(7) OLS']] <- ols2029.4
models[['(8) TSLS']] <- iv2029.4
diag1 <- summary(iv2029.1, diagnostics = TRUE)[[12]]
diag2 <- summary(iv2029.2, diagnostics = TRUE)[[12]]
diag3 <- summary(iv2029.3, diagnostics = TRUE)[[12]]
diag4 <- summary(iv2029.4, diagnostics = TRUE)[[12]]
library(tibble)
rows <- tribble(~label, ~OLS, ~TSLS, ~OLS, ~TSLS, ~OLS, ~TSLS, ~OLS, ~TSLS, 
                'Residence dummies', 'No', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes',
                'Year of birth dummies', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes',
                'J-test [dof]', '', paste(format(diag1["Sargan","statistic"], digits=3), "[", diag1["Sargan","df1"], "]"), '', paste(format(diag2["Sargan","statistic"], digits=3), "[", diag2["Sargan","df1"], "]"), '', paste(format(diag3["Sargan","statistic"], digits=3), "[", diag4["Sargan","df1"], "]"), '', paste(format(diag4["Sargan","statistic"], digits=3), "[", diag4["Sargan","df1"], "]"))
attr(rows, 'position') <- c(14, 15, 16)
library(modelsummary)
table4 <- modelsummary(models, coef_map = c("EDUC", "RACE", "SMSA", "MARRIED", "AGEQ", "AGEQSQ"), add_rows = rows, gof_map = c("nobs"), output = "huxtable")
library(huxtable)
set_text_color(table4, col = c(3, 5, 7, 9), value = 'red')

	(1) OLS	(2) TSLS	(3) OLS	(4) TSLS	(5) OLS	(6) TSLS	(7) OLS	(8) TSLS
EDUC	0.080	0.077	0.080	0.131	0.070	0.067	0.070	0.101
	(0.0004)	(0.015)	(0.0004)	(0.033)	(0.0004)	(0.015)	(0.0004)	(0.033)
RACE					-0.298	-0.306	-0.298	-0.227
					(0.004)	(0.035)	(0.004)	(0.078)
SMSA					-0.134	-0.136	-0.134	-0.116
					(0.003)	(0.009)	(0.003)	(0.020)
MARRIED					0.293	0.294	0.293	0.280
					(0.004)	(0.007)	(0.004)	(0.014)
AGEQ			0.145	0.141			0.116	0.117
			(0.068)	(0.070)			(0.065)	(0.066)
AGEQSQ			-0.002	-0.001			-0.001	-0.001
			(0.0007)	(0.0008)			(0.0007)	(0.0007)
Residence dummies	No	No	No	No	Yes	Yes	Yes	Yes

Year of birth dummies	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
J-test [dof]		36 [ 29 ]		25.6 [ 29 ]		34.2 [ 29 ]		28.8 [ 29 ]
Num.Obs.	247199	247199	247199	247199	247199	247199	247199	247199

AK91: Applying 2SLS using `feols()`, potential problems

library(fixest)
iv2029.1 <- feols(LWKLYWGE ~ i(YOB)
                  | EDUC ~ i(YOB, i.QOB), data=ak2029)
iv2029.2 <- feols(LWKLYWGE ~ i(YOB) + AGEQ + AGEQSQ
                  | EDUC ~ i(YOB, i.QOB), data=ak2029)
iv2029.3 <- feols(LWKLYWGE ~ RACE + MARRIED
                  + SMSA + NEWENG + MIDATL + ENOCENT
                  + WNOCENT + SOATL + ESOCENT + WSOCENT
                  + MT + i(YOB) | EDUC ~ i(YOB, i.QOB), data=ak2029)
iv2029.4 <- feols(LWKLYWGE ~ RACE + MARRIED
                  + SMSA + NEWENG + MIDATL + ENOCENT
                  + WNOCENT + SOATL + ESOCENT + WSOCENT
                  + MT + AGEQ + AGEQSQ + i(YOB)
                  | EDUC ~ i(YOB, i.QOB), data=ak2029)

AK91: Tables using `fixest` package

# Rename variables to something more readable for a report
setFixest_dict(c(EDUC = "Years of schooling", AGEQ = "Age",
                 AGEQSQ = "Age squared"))
# Report other statistics at bottom of table
fitstat_register("sargan.df", function(x) fitstat(x, type = "sargan")[[1]]$df, "Sargan df")
# Built-in table generator from fixest
etable(iv2029.1, iv2029.2, iv2029.3, iv2029.4, 
       keep = c("Years of schooling", "RACE", "SMSA", "MARRIED", "Age squared", "Age"),
       se.below = TRUE, 
       signif.code = NA, 
       depvar = FALSE, 
       se.row = FALSE, 
       fitstat = c("sargan", "sargan.df"),
       order = c("Years of schooling", "RACE", "SMSA", "MARRIED", "Age", "Age squared"),
       group = list("Residence dummies" = c("NEWENG", "MIDATL", "ENOCENT", "WNOCENT", "SOATL", "ESOCENT", "WSOCENT", "MT")),
       extralines = list("Year of birth dummies" = rep("Yes", 4)))

	iv2029.1	iv2029.2	iv2029.3	iv2029.4
Years of schooling	0.0769	9.7e-5	0.0669	-8.71e-6
	(0.0150)	(6.96e-5)	(0.0151)	(6.92e-5)
RACE			-0.3055	-0.4605
			(0.0353)	(0.0046)
SMSA			-0.1362	-0.1756
			(0.0092)	(0.0028)
MARRIED			0.2941	0.3213
			(0.0072)	(0.0040)
Age squared		-0.0023		-0.0014
		(0.0009)		(0.0008)
Age		0.1927		0.1122
		(0.0802)		(0.0722)
Residence dummies	No	No	Yes	Yes
Year of birth dummies	Yes	Yes	Yes	Yes
_____________________	________	_________	________	_________
Sargan	36.023	-17.026	34.165	-214.58
Sargan df	38	38	38	38

AK91: Applying OLS using `feols()`

# Rescaling age (you will see why later)
ak2029$AGEQs <- ak2029$AGEQ/10
ak2029$AGEQSQs <- (ak2029$AGEQ/10)^2
ols2029.1 <- feols(LWKLYWGE~EDUC+i(YOB), data=ak2029)
ols2029.2 <- feols(LWKLYWGE~EDUC+AGEQs+AGEQSQs+i(YOB), 
                data=ak2029)
ols2029.3 <- feols(LWKLYWGE~EDUC+RACE+MARRIED+SMSA
                +NEWENG+MIDATL+ENOCENT+WNOCENT
                +SOATL+ESOCENT+WSOCENT+MT
                +i(YOB), data=ak2029)
ols2029.4 <- feols(LWKLYWGE~EDUC+RACE+MARRIED+SMSA
                +NEWENG+MIDATL+ENOCENT+WNOCENT
                +SOATL+ESOCENT+WSOCENT+MT
                +AGEQs+AGEQSQs+i(YOB), data=ak2029)

AK91: Applying 2SLS using `feols()`

iv2029.1 <- feols(LWKLYWGE ~ i(YOB)
                  | EDUC ~ i(YOB, i.QOB), data=ak2029)
iv2029.2 <- feols(LWKLYWGE ~ i(YOB) + AGEQs + AGEQSQs
                  | EDUC ~ i(YOB, i.QOB), data=ak2029)
iv2029.3 <- feols(LWKLYWGE ~ RACE + MARRIED
                  + SMSA + NEWENG + MIDATL + ENOCENT
                  + WNOCENT + SOATL + ESOCENT + WSOCENT
                  + MT + i(YOB) | EDUC ~ i(YOB, i.QOB), data=ak2029)
iv2029.4 <- feols(LWKLYWGE ~ RACE + MARRIED
                  + SMSA + NEWENG + MIDATL + ENOCENT
                  + WNOCENT + SOATL + ESOCENT + WSOCENT
                  + MT + AGEQs + AGEQSQs + i(YOB)
                  | EDUC ~ i(YOB, i.QOB), data=ak2029)

Some diagnostics typically reported with IV/2SLS:

fitstat(iv2029.4, type = c("ivwald1", "wh", "sargan"))

Wald (1st stage), Years of schooling: stat =  3.25657 , p = 3.015e-11, on 39 and 247,147 DoF, VCOV: IID.
                          Wu-Hausman: stat =  0.734249, p = 0.39151  , on 1 and 247,174 DoF.
                              Sargan: stat = 29.0     , p = 0.854253 , on 38 DoF.

# Rename again
setFixest_dict(c(EDUC = "Years of schooling", AGEQs = "Age/10",
                 AGEQSQs = "Age squared/100"))
# Built-in table generator from fixest
etable(ols2029.1, iv2029.1, ols2029.2, iv2029.2, ols2029.3, iv2029.3, ols2029.4, iv2029.4, 
       keep = c("Years of schooling", "RACE", "SMSA", "MARRIED", "Age squared/100", "Age/10"),
       se.below = TRUE, 
       signif.code = NA, 
       depvar = FALSE, 
       se.row = FALSE, 
       fitstat = c("sargan", "sargan.df"),
       order = c("Years of schooling", "RACE", "SMSA", "MARRIED", "Age/10", "Age squared/100"),
       group = list("Residence dummies" = c("NEWENG", "MIDATL", "ENOCENT", "WNOCENT", "SOATL", "ESOCENT", "WSOCENT", "MT")),
       extralines = list("Year of birth dummies" = rep("Yes", 4)))

	ols20..1	iv2029.1	ols20..2	iv2029.2	ols20..3	iv2029.3	ols20..4	iv2029.4
Years of schooling	0.0802	0.0769	0.0802	0.1319	0.0701	0.0669	0.0701	0.0986
	(0.0004)	(0.0150)	(0.0004)	(0.0335)	(0.0004)	(0.0151)	(0.0004)	(0.0335)
RACE					-0.2980	-0.3055	-0.2980	-0.2321
					(0.0043)	(0.0353)	(0.0043)	(0.0777)
SMSA					-0.1343	-0.1362	-0.1343	-0.1176
					(0.0026)	(0.0092)	(0.0026)	(0.0199)
MARRIED					0.2928	0.2941	0.2928	0.2812
					(0.0037)	(0.0072)	(0.0037)	(0.0141)
Age/10			1.446	1.411			1.162	1.203
			(0.6760)	(0.7048)			(0.6517)	(0.6604)
Age squared/100			-0.1542	-0.1357			-0.1250	-0.1217
			(0.0748)	(0.0788)			(0.0721)	(0.0733)
Residence dummies	No	No	No	No	Yes	Yes	Yes	Yes
Year of birth dummies	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
_____________________	________	________	________	________	________	________	________	________
Sargan	--	36.023	--	25.598	--	34.165	--	28.959
Sargan df	--	38	--	38	--	38	--	38

Problems with `feols()`

There are problems with too many dummy variables, even with a very large sample size!
This is what you should be concerned about if you are worried about multicollinearity. But this was fixed by transforming the age variables.
feols() is newly developed, so be careful when you use packages readily available. Convenience does have its price.
Still, the degrees of freedom produced by Sargan statistic reported by feols() are incorrect.

Notes on generating tables

Hard to teach the generation of tables, but the codes could be a start for you.
You have to read the help file or understand components of examples so that you can trace what is happening.
But might be more important to set aside a lot of time figuring these out and to design the table you have in mind first, before tinkering with the commands.
Always check what the tables would look like!

When does regression recover causal effects?

Previously…

Now…

Illustration: Mankiw, Romer, and Weil (1992)

Theory behind MRW

Econometrics behind MRW

Data used by MRW

Replicating Table I of MRW (1992)

Testing the Solow growth model

MRW: Nonlinear functions of regression parameters

How do you establish causality?

The need for new language

Building intuition

Another formulation

Randomized controlled trials

“As-if” randomization

What \(X\) should we include?

Case 1

Case 2

Case 3

What if “as-if” randomization cannot be achieved?

Instrumental variable strategies

Example of linear structural equations: setup

Example of linear structural equations: source of the problem

Example of linear structural equations: recovering \(\beta_0\) and \(\beta_1\)

The method of moments

What happens if \(L>K\): building intuition

What happens if \(L>K\): reduced forms

What happens if \(L>K\): two-stage least squares estimand

What happens if \(L>K\): 2SLS estimator

Generalizing the method of moments

How do we test whether the exogeneity holds?

Angrist and Krueger (1991)

AK91: Loading data

AK91: Applying OLS using lm()

AK91: Applying 2SLS using ivreg()

AK91: IV/2SLS diagnostics

AK91: tables

AK91: Applying 2SLS using feols(), potential problems

AK91: Tables using fixest package

AK91: Applying OLS using feols()

AK91: Applying 2SLS using feols()

Problems with feols()

Notes on generating tables

AK91: Applying OLS using `lm()`

AK91: Applying 2SLS using `ivreg()`

AK91: Applying 2SLS using `feols()`, potential problems

AK91: Tables using `fixest` package

AK91: Applying OLS using `feols()`

AK91: Applying 2SLS using `feols()`

Problems with `feols()`