Introduction to Instrumental Variables

class: center, middle, inverse, title-slide

.title[
# Introduction to Instrumental Variables
]
.subtitle[
## <html>
<div style="float:left">

</div>
<hr color='#EB811B' size=1px width=0px>
</html>
]
.author[
### Ian McCarthy | Emory University
]
.date[
### Econ 771, Fall 2022
]

---

class: inverse, center, middle

<style type="text/css">
.remark-slide-content {
    font-size: 30px;
    padding: 1em 2em 1em 2em;    
}
.remark-code {
  font-size: 15px;
}
.remark-inline-code { 
    font-size: 20px;
}
</style>

# Assessing Selection on Unobservables

---
class: clear

- Say we estimate a regression like `$$y_{i} = \delta D_{i} + \beta_{1} x_{1i} + \varepsilon_{i}$$`

- But we are concerned that the "true" specification is `$$y_{i} = \delta D_{i} + \beta_{1} x_{1i} + \beta_{2} x_{2i} + \varepsilon_{i}$$`

- **Idea:** Extending the work of Altonji and others, Oster (2019) aims to decompose outome into a treatment effect ($\delta$), observed controls ($x_{1}$), unobserved controls ($x_{2i}$), and iid error

---
# Oster (2019), JBES

- **Key assumption:** Selection on observables is informative about selection on unobservables

1. What is the maximum `$R^2$` value we could obtain if we observed `$x_{2}$`? Call this `$R_{\text{max}}^{2}$` (naturally bounded above by 1, but likely smaller)

2. What is the degree of selection on observed variables relative to unobserved variables? Denote the proportional relationship as `$\rho$` such that: `$$\rho \times \frac{Cov(x_{1},D)}{Var(x_{1})} = \frac{Cov(x_{2},D)}{Var(x_{2})}.$$`

---
# Oster (2019), JBES

- Under an "equal relative contributions" assumption, we can write:

`$$\delta^{*} \approx \hat{\delta}_{D,x_{1}} - \rho \times \left[\hat{\delta}_{D} - \hat{\delta}_{D,x_{1}}\right] \times \frac{R_{\text{max}}^{2} - R_{D,x_{1}}^{2}}{R_{D,x_{1}}^{2} - R_{x_{1}}^{2}} \xrightarrow{p} \delta.$$`

- Consider a range of `$R^{2}_{\text{max}}$` and `$\rho$` to bound the estimated treatment effect,

`$$\left[ \hat{\delta}_{D,x_{1}}, \delta^{*}(\bar{R}^{2}_{max}, \rho) \right]$$`

---
# Augmented regression (somewhat out of place here)

- Oster (2019) and similar papers can say something about how bad selection on unobservables would need to be

- But what kind of "improvement" do we really get in practice?

---
class: clear

- Original test from Hausman (1978) not specific to endogeneity, just a general misspecification test
- Compare estimates from one estimator (efficient under the null) to another estimator that is consistent but inefficient under the null 
- In IV context, also known as Durbin-Wu-Hausman test, due to the series of papers pre-dating Hausman (1978), including Durbin and Wu in the 1950s

---
class: clear

- Easily implemented as an "artificial" or "augmented" regression
- We want to estimate `$y=\beta_{1}x_{1} + \beta_{2}x_{2} + \varepsilon$`, with exogenous variables `$x_{1}$`, endogenous variables `$x_{2}$`, and instruments `$z$`
    1. Regress each of the variables in `$x_{2}$` on `$x_{1}$` and `$z$` and form residuals, `$\hat{v}$`, `$x_{2} = \lambda_{x} x_{1} + \lambda_{z} z + v$`
    2. Include `$\hat{v}$` in the standard OLS regression of `$y$` on `$x_{1}$`, `$x_{2}$`, and `$\hat{v}$`.
    3. Test `$H_{0}: \beta_{\hat{v}} = 0$`. Rejection implies OLS is inconsistent.

--
<br>

Intuition: Only way for `$x_{2}$` to be correlated with `$\varepsilon$` is through `$v$`, **assuming `$z$` is a "good" instrument**

---
class: clear

- Do we have an endogeneity problem?
    - Effects easily overcome by small selection on unobservables?
    - Clear reverse causality problem?

- What can we do about it?
    - Matching, weighting, regression? Only for selection on observables
    - DD, RD, differences in discontinuities? Specific designs and settings
    - Instrumental variables?

---
class: inverse, center, middle

# Instrumental Variables

---
# What is instrumental variables

Instrumental Variables (IV) is a way to identify causal effects using variation in treatment particpation that is due to an *exogenous* variable that is only related to the outcome through treatment.

---
# Simple example

- `$y = \beta x + \varepsilon (x)$`,<br>
where `$\varepsilon(x)$` reflects the dependence between our observed variable and the error term.<br>

- Simple OLS will yield<br>
`$\frac{dy}{dx} = \beta + \frac{d\varepsilon}{dx} \neq \beta$`

---
# What does IV do?

- The regression we want to do: <br>
`$y_{i} = \alpha + \delta D_{i} + \gamma A_{i} + \epsilon_{i}$`,<br>
where `$D_{i}$` is treatment (think of schooling for now) and `$A_{i}$` is something like ability.

- `$A_{i}$` is unobserved, so instead we run: <br>
`$y_{i} = \alpha + \beta D_{i} + \epsilon_{i}$`

- From this "short" regression, we don't actually estimate `$\delta$`. Instead, we get an estimate of<br>
`$\beta = \delta + \lambda_{ds}\gamma \neq \delta$`,<br>
where `$\lambda_{ds}$` is the coefficient of a regression of `$A_{i}$` on `$D_{i}$`.

---
# Intuition
IV will recover the "long" regression without observing underlying ability<br>

--
<br>

*IF* our IV satisfies all of the necessary assumptions.

---
# More formally
- We want to estimate<br>
`$E[Y_{i} | D_{i}=1] - E[Y_{i} | D_{i}=0]$`

- With instrument `$Z_{i}$` that satisfies relevant assumptions, we can estimate this as<br>
`$E[Y_{i} | D_{i}=1] - E[Y_{i} | D_{i}=0] = \frac{E[Y_{i} | Z_{i}=1] - E[Y_{i} | Z_{i}=0]}{E[D_{i} | Z_{i}=1] - E[D_{i} | Z_{i}=0]}$`

- In words, this is effect of the instrument on the outcome ("reduced form") divided by the effect of the instrument on treatment ("first stage")

---
# Derivation
Recall "long" regression: `$Y=\alpha + \delta S + \gamma A + \epsilon$`.

`$$\begin{align}
COV(Y,Z) & = E[YZ] - E[Y] E[Z] \\
         & = E[(\alpha + \delta S + \gamma A + \epsilon)\times Z] - E[\alpha + \delta S + \gamma A + \epsilon)]E[Z] \\
         & = \alpha E[Z] + \delta E[SZ] + \gamma E[AZ] + E[\epsilon Z] \\
         & \hspace{.2in} - \alpha E[Z] - \delta E[S]E[Z] - \gamma E[A] E[Z] - E[\epsilon]E[Z] \\
         & = \delta (E[SZ] - E[S] E[Z]) + \gamma (E[AZ] - E[A] E[Z]) \\
         & \hspace{.2in} + E[\epsilon Z] - E[\epsilon] E[Z] \\
         & = \delta C(S,Z) + \gamma C(A,Z) + C(\epsilon, Z)
\end{align}$$`

---
# Derivation

Working from `$COV(Y,Z) = \delta COV(S,Z) + \gamma COV(A,Z) + COV(\epsilon,Z)$`, we find

`$$\delta = \frac{COV(Y,Z)}{COV(S,Z)}$$`

if `$COV(A,Z)=COV(\epsilon, Z)=0$`

---
# IVs in practice
Easy to think of in terms of randomized controlled trial...

--
<br>

Measure    | Offered Seat | Not Offered Seat | Difference 
 ---------- | ------------ | ---------------- | ---------- 
 Score      | -0.003       | -0.358           | 0.355      
 % Enrolled | 0.787        | 0.046            | 0.741   
 Effect     |              |                  | 0.48

<br>

.footnote[
Angrist *et al.*, 2012. "Who Benefits from KIPP?" *Journal of Policy Analysis and Management*.
]

---
# What is IV *really* doing
Think of IV as two-steps:

1. Isolate variation due to the instrument only (not due to endogenous stuff)
2. Estimate effect on outcome using only this source of variation

---
# In regression terms
Interested in estimating `$\delta$` from `$y_{i} = \alpha + \beta x_{i} + \delta D_{i} + \varepsilon_{i}$`, but `$D_{i}$` is endogenous (no pure "selection on observables").

--
<br>

<b>Step 1:</b> With instrument `$Z_{i}$`, we can regress `$D_{i}$` on `$Z_{i}$` and `$x_{i}$`,<br>
`$D_{i} = \lambda + \theta Z_{i} + \kappa x_{i} + \nu$`,<br>
and form prediction `$\hat{D}_{i}$`.

--
<br>

<b>Step 2:</b> Regress `$y_{i}$` on `$x_{i}$` and `$\hat{D}_{i}$`,<br>
`$y_{i} = \alpha + \beta x_{i} + \delta \hat{D}_{i} + \xi_{i}$`

---
# Derivation
Recall `$\hat{\theta}=\frac{C(Z,S)}{V(Z)}$`, or `$\hat{\theta}V(Z) = C(S,Z)$`. Then:

`$$\begin{align}
\hat{\delta}  & = \frac{COV(Y,Z)}{COV(S,Z)} \\
        & = \frac{\hat{\theta}C(Y,Z)}{\hat{\theta}C(S,Z)} = \frac{\hat{\theta}C(Y,Z)}{\hat{\theta}^{2}V(Z)} \\
        & = \frac{C(\hat{\theta}Z,Y)}{V(\hat{\theta}Z)} = \frac{C(\hat{S},Y)}{V(\hat{S})}
\end{align}$$`

---
# Animation for IV

.center[
  ![](pics/iv_animate.gif)
]

---
# Simulated data
.pull-left[

```r
n <- 5000
b.true <- 5.25
iv.dat <- tibble(
  z = rnorm(n,0,2),
  eps = rnorm(n,0,1),
  d = (z + 1.5*eps + rnorm(n,0,1) >0.25),
  y = 2.5 + b.true*d + eps + rnorm(n,0,0.5)
)
```
]

.pull-right[
- endogenous `eps`: affects treatment and outcome
- `z` is an instrument: affects treatment but no direct effect on outcome
]

---
# Results with simulated data
Recall that the *true* treatment effect is 5.25
.pull-left[

```
## 
## Call:
## lm(formula = y ~ d, data = iv.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5361 -0.6813  0.0009  0.6888  3.5645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.07914    0.01980   105.0   <2e-16 ***
## dTRUE        6.16356    0.02891   213.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.02 on 4998 degrees of freedom
## Multiple R-squared:  0.9009,	Adjusted R-squared:  0.9009 
## F-statistic: 4.546e+04 on 1 and 4998 DF,  p-value: < 2.2e-16
```
]

.pull-right[

```
## TSLS estimation, Dep. Var.: y, Endo.: d, Instr.: z
## Second stage: Dep. Var.: y
## Observations: 5,000 
## Standard-errors: IID 
##             Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept)  2.46865   0.028767  85.8167 < 2.2e-16 ***
## fit_dTRUE    5.33305   0.051572 103.4095 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 1.10089   Adj. R2: 0.884565
## F-test (1st stage), dTRUE: stat = 2,886.4, p < 2.2e-16, on 1 and 4,998 DoF.
##                Wu-Hausman: stat =   526.8, p < 2.2e-16, on 1 and 4,997 DoF.
```
]

---
# Two-stage equivalence

```r
step1 <- lm(d ~ z, data=iv.dat)
d.hat <- predict(step1)
step2 <- lm(y ~ d.hat, data=iv.dat)
summary(step2)
```

```
## 
## Call:
## lm(formula = y ~ d.hat, data = iv.dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.937 -2.174 -0.022  2.134  9.212 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.46865    0.07348   33.59   <2e-16 ***
## d.hat        5.33305    0.13174   40.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.813 on 4998 degrees of freedom
## Multiple R-squared:  0.2469,	Adjusted R-squared:  0.2468 
## F-statistic:  1639 on 1 and 4998 DF,  p-value: < 2.2e-16
```

---
class: inverse, center, middle

# Assumptions of Instrumental Variables

---
# Key IV assumptions
1. *Exclusion:* Instrument is uncorrelated with the error term<br>

2. *Validity:* Instrument is correlated with the endogenous variable<br>

3. *Monotonicity:* Treatment more (less) likely for those with higher (lower) values of the instrument<br>

--
<br>

Assumptions 1 and 2 sometimes grouped into an *only through* condition.

---
# Checking instrument
.pull-left[
- Check the 'first stage'

```
## 
## Call:
## lm(formula = d ~ z, data = iv.dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.19047 -0.32694 -0.00995  0.32877  1.10957 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.468338   0.005620   83.33   <2e-16 ***
## z           0.152773   0.002844   53.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3974 on 4998 degrees of freedom
## Multiple R-squared:  0.3661,	Adjusted R-squared:  0.366 
## F-statistic:  2886 on 1 and 4998 DF,  p-value: < 2.2e-16
```
]

.pull-right[
- Check the 'reduced form'

```
## 
## Call:
## lm(formula = y ~ z, data = iv.dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.937 -2.174 -0.022  2.134  9.212 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.96632    0.03978  124.85   <2e-16 ***
## z            0.81474    0.02013   40.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.813 on 4998 degrees of freedom
## Multiple R-squared:  0.2469,	Adjusted R-squared:  0.2468 
## F-statistic:  1639 on 1 and 4998 DF,  p-value: < 2.2e-16
```
]

---
# Do we need IV?

- Let's run an "augmented regression" to see if our OLS results are sufficiently different than IV

```r
d.iv <- lm(d ~ z, data=iv.dat)
d.resid <- residuals(d.iv)
haus.test <- lm(y ~ d + d.resid, data=iv.dat)
summary(haus.test)
```

```
## 
## Call:
## lm(formula = y ~ d + d.resid, data = iv.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5610 -0.6366  0.0037  0.6401  3.6203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.46865    0.02535   97.38   <2e-16 ***
## dTRUE        5.33305    0.04545  117.35   <2e-16 ***
## d.resid      1.31015    0.05708   22.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9703 on 4997 degrees of freedom
## Multiple R-squared:  0.9104,	Adjusted R-squared:  0.9104 
## F-statistic: 2.538e+04 on 2 and 4997 DF,  p-value: < 2.2e-16
```

- Test for significance of `d.resid` suggests OLS is inconsistent in this case

---
# Testing exclusion

- Exclusion restriction says that your instrument does not directly affect your outcome
- Potential testing ideas:
  - "zero-first-stage" (subsample on which you know the instrument does not affect the endogenous variable)
  - augmented regression of reduced-form effect with subset of instruments (overidentified models only)

---
# Testing exogeneity

- Only available in over-identified models
- Sargan or Hansen's J test (null hypothesis is that instruments are correlated with residuals)