class: center, middle, inverse, title-slide .title[ # Introduction to Instrumental Variables ] .subtitle[ ##
] .author[ ### Ian McCarthy | Emory University ] .date[ ### Econ 771, Fall 2022 ] --- class: inverse, center, middle <!-- Adjust some CSS code for font size and maintain R code font size --> <style type="text/css"> .remark-slide-content { font-size: 30px; padding: 1em 2em 1em 2em; } .remark-code { font-size: 15px; } .remark-inline-code { font-size: 20px; } </style> <!-- Set R options for how code chunks are displayed and load packages --> # Assessing Selection on Unobservables <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1055px></html> --- class: clear - Say we estimate a regression like `$$y_{i} = \delta D_{i} + \beta_{1} x_{1i} + \varepsilon_{i}$$` - But we are concerned that the "true" specification is `$$y_{i} = \delta D_{i} + \beta_{1} x_{1i} + \beta_{2} x_{2i} + \varepsilon_{i}$$` - **Idea:** Extending the work of Altonji and others, Oster (2019) aims to decompose outome into a treatment effect ($\delta$), observed controls ($x_{1}$), unobserved controls ($x_{2i}$), and iid error --- # Oster (2019), JBES - **Key assumption:** Selection on observables is informative about selection on unobservables 1. What is the maximum `\(R^2\)` value we could obtain if we observed `\(x_{2}\)`? Call this `\(R_{\text{max}}^{2}\)` (naturally bounded above by 1, but likely smaller) 2. What is the degree of selection on observed variables relative to unobserved variables? Denote the proportional relationship as `\(\rho\)` such that: `$$\rho \times \frac{Cov(x_{1},D)}{Var(x_{1})} = \frac{Cov(x_{2},D)}{Var(x_{2})}.$$` --- # Oster (2019), JBES - Under an "equal relative contributions" assumption, we can write: `$$\delta^{*} \approx \hat{\delta}_{D,x_{1}} - \rho \times \left[\hat{\delta}_{D} - \hat{\delta}_{D,x_{1}}\right] \times \frac{R_{\text{max}}^{2} - R_{D,x_{1}}^{2}}{R_{D,x_{1}}^{2} - R_{x_{1}}^{2}} \xrightarrow{p} \delta.$$` - Consider a range of `\(R^{2}_{\text{max}}\)` and `\(\rho\)` to bound the estimated treatment effect, `$$\left[ \hat{\delta}_{D,x_{1}}, \delta^{*}(\bar{R}^{2}_{max}, \rho) \right]$$` --- # Augmented regression (somewhat out of place here) - Oster (2019) and similar papers can say something about how bad selection on unobservables would need to be - But what kind of "improvement" do we really get in practice? --- class: clear - Original test from Hausman (1978) not specific to endogeneity, just a general misspecification test - Compare estimates from one estimator (efficient under the null) to another estimator that is consistent but inefficient under the null - In IV context, also known as Durbin-Wu-Hausman test, due to the series of papers pre-dating Hausman (1978), including Durbin and Wu in the 1950s --- class: clear - Easily implemented as an "artificial" or "augmented" regression - We want to estimate `\(y=\beta_{1}x_{1} + \beta_{2}x_{2} + \varepsilon\)`, with exogenous variables `\(x_{1}\)`, endogenous variables `\(x_{2}\)`, and instruments `\(z\)` 1. Regress each of the variables in `\(x_{2}\)` on `\(x_{1}\)` and `\(z\)` and form residuals, `\(\hat{v}\)`, `\(x_{2} = \lambda_{x} x_{1} + \lambda_{z} z + v\)` 2. Include `\(\hat{v}\)` in the standard OLS regression of `\(y\)` on `\(x_{1}\)`, `\(x_{2}\)`, and `\(\hat{v}\)`. 3. Test `\(H_{0}: \beta_{\hat{v}} = 0\)`. Rejection implies OLS is inconsistent. -- <br> Intuition: Only way for `\(x_{2}\)` to be correlated with `\(\varepsilon\)` is through `\(v\)`, **assuming `\(z\)` is a "good" instrument** --- class: clear - Do we have an endogeneity problem? - Effects easily overcome by small selection on unobservables? - Clear reverse causality problem? - What can we do about it? - Matching, weighting, regression? Only for selection on observables - DD, RD, differences in discontinuities? Specific designs and settings - Instrumental variables? --- class: inverse, center, middle # Instrumental Variables <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1055px></html> --- # What is instrumental variables Instrumental Variables (IV) is a way to identify causal effects using variation in treatment particpation that is due to an *exogenous* variable that is only related to the outcome through treatment. <img src="03-1_files/figure-html/tikz-timeline-1.png" style="display: block; margin: auto;" /> --- # Simple example - `\(y = \beta x + \varepsilon (x)\)`,<br> where `\(\varepsilon(x)\)` reflects the dependence between our observed variable and the error term.<br> - Simple OLS will yield<br> `\(\frac{dy}{dx} = \beta + \frac{d\varepsilon}{dx} \neq \beta\)` --- # What does IV do? - The regression we want to do: <br> `\(y_{i} = \alpha + \delta D_{i} + \gamma A_{i} + \epsilon_{i}\)`,<br> where `\(D_{i}\)` is treatment (think of schooling for now) and `\(A_{i}\)` is something like ability. - `\(A_{i}\)` is unobserved, so instead we run: <br> `\(y_{i} = \alpha + \beta D_{i} + \epsilon_{i}\)` - From this "short" regression, we don't actually estimate `\(\delta\)`. Instead, we get an estimate of<br> `\(\beta = \delta + \lambda_{ds}\gamma \neq \delta\)`,<br> where `\(\lambda_{ds}\)` is the coefficient of a regression of `\(A_{i}\)` on `\(D_{i}\)`. --- # Intuition IV will recover the "long" regression without observing underlying ability<br> -- <br> *IF* our IV satisfies all of the necessary assumptions. --- # More formally - We want to estimate<br> `\(E[Y_{i} | D_{i}=1] - E[Y_{i} | D_{i}=0]\)` - With instrument `\(Z_{i}\)` that satisfies relevant assumptions, we can estimate this as<br> `\(E[Y_{i} | D_{i}=1] - E[Y_{i} | D_{i}=0] = \frac{E[Y_{i} | Z_{i}=1] - E[Y_{i} | Z_{i}=0]}{E[D_{i} | Z_{i}=1] - E[D_{i} | Z_{i}=0]}\)` - In words, this is effect of the instrument on the outcome ("reduced form") divided by the effect of the instrument on treatment ("first stage") --- # Derivation Recall "long" regression: `\(Y=\alpha + \delta S + \gamma A + \epsilon\)`. `$$\begin{align} COV(Y,Z) & = E[YZ] - E[Y] E[Z] \\ & = E[(\alpha + \delta S + \gamma A + \epsilon)\times Z] - E[\alpha + \delta S + \gamma A + \epsilon)]E[Z] \\ & = \alpha E[Z] + \delta E[SZ] + \gamma E[AZ] + E[\epsilon Z] \\ & \hspace{.2in} - \alpha E[Z] - \delta E[S]E[Z] - \gamma E[A] E[Z] - E[\epsilon]E[Z] \\ & = \delta (E[SZ] - E[S] E[Z]) + \gamma (E[AZ] - E[A] E[Z]) \\ & \hspace{.2in} + E[\epsilon Z] - E[\epsilon] E[Z] \\ & = \delta C(S,Z) + \gamma C(A,Z) + C(\epsilon, Z) \end{align}$$` --- # Derivation Working from `\(COV(Y,Z) = \delta COV(S,Z) + \gamma COV(A,Z) + COV(\epsilon,Z)\)`, we find `$$\delta = \frac{COV(Y,Z)}{COV(S,Z)}$$` if `\(COV(A,Z)=COV(\epsilon, Z)=0\)` --- # IVs in practice Easy to think of in terms of randomized controlled trial... -- <br> Measure | Offered Seat | Not Offered Seat | Difference ---------- | ------------ | ---------------- | ---------- Score | -0.003 | -0.358 | 0.355 % Enrolled | 0.787 | 0.046 | 0.741 Effect | | | 0.48 <br> .footnote[ Angrist *et al.*, 2012. "Who Benefits from KIPP?" *Journal of Policy Analysis and Management*. ] --- # What is IV *really* doing Think of IV as two-steps: 1. Isolate variation due to the instrument only (not due to endogenous stuff) 2. Estimate effect on outcome using only this source of variation --- # In regression terms Interested in estimating `\(\delta\)` from `\(y_{i} = \alpha + \beta x_{i} + \delta D_{i} + \varepsilon_{i}\)`, but `\(D_{i}\)` is endogenous (no pure "selection on observables"). -- <br> <b>Step 1:</b> With instrument `\(Z_{i}\)`, we can regress `\(D_{i}\)` on `\(Z_{i}\)` and `\(x_{i}\)`,<br> `\(D_{i} = \lambda + \theta Z_{i} + \kappa x_{i} + \nu\)`,<br> and form prediction `\(\hat{D}_{i}\)`. -- <br> <b>Step 2:</b> Regress `\(y_{i}\)` on `\(x_{i}\)` and `\(\hat{D}_{i}\)`,<br> `\(y_{i} = \alpha + \beta x_{i} + \delta \hat{D}_{i} + \xi_{i}\)` --- # Derivation Recall `\(\hat{\theta}=\frac{C(Z,S)}{V(Z)}\)`, or `\(\hat{\theta}V(Z) = C(S,Z)\)`. Then: `$$\begin{align} \hat{\delta} & = \frac{COV(Y,Z)}{COV(S,Z)} \\ & = \frac{\hat{\theta}C(Y,Z)}{\hat{\theta}C(S,Z)} = \frac{\hat{\theta}C(Y,Z)}{\hat{\theta}^{2}V(Z)} \\ & = \frac{C(\hat{\theta}Z,Y)}{V(\hat{\theta}Z)} = \frac{C(\hat{S},Y)}{V(\hat{S})} \end{align}$$` --- # Animation for IV .center[ ![](pics/iv_animate.gif) ] --- # Simulated data .pull-left[ ```r n <- 5000 b.true <- 5.25 iv.dat <- tibble( z = rnorm(n,0,2), eps = rnorm(n,0,1), d = (z + 1.5*eps + rnorm(n,0,1) >0.25), y = 2.5 + b.true*d + eps + rnorm(n,0,0.5) ) ``` ] .pull-right[ - endogenous `eps`: affects treatment and outcome - `z` is an instrument: affects treatment but no direct effect on outcome ] --- # Results with simulated data Recall that the *true* treatment effect is 5.25 .pull-left[ ``` ## ## Call: ## lm(formula = y ~ d, data = iv.dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.5361 -0.6813 0.0009 0.6888 3.5645 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.07914 0.01980 105.0 <2e-16 *** ## dTRUE 6.16356 0.02891 213.2 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.02 on 4998 degrees of freedom ## Multiple R-squared: 0.9009, Adjusted R-squared: 0.9009 ## F-statistic: 4.546e+04 on 1 and 4998 DF, p-value: < 2.2e-16 ``` ] .pull-right[ ``` ## TSLS estimation, Dep. Var.: y, Endo.: d, Instr.: z ## Second stage: Dep. Var.: y ## Observations: 5,000 ## Standard-errors: IID ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.46865 0.028767 85.8167 < 2.2e-16 *** ## fit_dTRUE 5.33305 0.051572 103.4095 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 1.10089 Adj. R2: 0.884565 ## F-test (1st stage), dTRUE: stat = 2,886.4, p < 2.2e-16, on 1 and 4,998 DoF. ## Wu-Hausman: stat = 526.8, p < 2.2e-16, on 1 and 4,997 DoF. ``` ] --- # Two-stage equivalence ```r step1 <- lm(d ~ z, data=iv.dat) d.hat <- predict(step1) step2 <- lm(y ~ d.hat, data=iv.dat) summary(step2) ``` ``` ## ## Call: ## lm(formula = y ~ d.hat, data = iv.dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -9.937 -2.174 -0.022 2.134 9.212 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.46865 0.07348 33.59 <2e-16 *** ## d.hat 5.33305 0.13174 40.48 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.813 on 4998 degrees of freedom ## Multiple R-squared: 0.2469, Adjusted R-squared: 0.2468 ## F-statistic: 1639 on 1 and 4998 DF, p-value: < 2.2e-16 ``` --- class: inverse, center, middle # Assumptions of Instrumental Variables <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1055px></html> --- # Key IV assumptions 1. *Exclusion:* Instrument is uncorrelated with the error term<br> 2. *Validity:* Instrument is correlated with the endogenous variable<br> 3. *Monotonicity:* Treatment more (less) likely for those with higher (lower) values of the instrument<br> -- <br> Assumptions 1 and 2 sometimes grouped into an *only through* condition. --- # Checking instrument .pull-left[ - Check the 'first stage' ``` ## ## Call: ## lm(formula = d ~ z, data = iv.dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.19047 -0.32694 -0.00995 0.32877 1.10957 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.468338 0.005620 83.33 <2e-16 *** ## z 0.152773 0.002844 53.73 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3974 on 4998 degrees of freedom ## Multiple R-squared: 0.3661, Adjusted R-squared: 0.366 ## F-statistic: 2886 on 1 and 4998 DF, p-value: < 2.2e-16 ``` ] .pull-right[ - Check the 'reduced form' ``` ## ## Call: ## lm(formula = y ~ z, data = iv.dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -9.937 -2.174 -0.022 2.134 9.212 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.96632 0.03978 124.85 <2e-16 *** ## z 0.81474 0.02013 40.48 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.813 on 4998 degrees of freedom ## Multiple R-squared: 0.2469, Adjusted R-squared: 0.2468 ## F-statistic: 1639 on 1 and 4998 DF, p-value: < 2.2e-16 ``` ] --- # Do we need IV? - Let's run an "augmented regression" to see if our OLS results are sufficiently different than IV ```r d.iv <- lm(d ~ z, data=iv.dat) d.resid <- residuals(d.iv) haus.test <- lm(y ~ d + d.resid, data=iv.dat) summary(haus.test) ``` ``` ## ## Call: ## lm(formula = y ~ d + d.resid, data = iv.dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.5610 -0.6366 0.0037 0.6401 3.6203 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.46865 0.02535 97.38 <2e-16 *** ## dTRUE 5.33305 0.04545 117.35 <2e-16 *** ## d.resid 1.31015 0.05708 22.95 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9703 on 4997 degrees of freedom ## Multiple R-squared: 0.9104, Adjusted R-squared: 0.9104 ## F-statistic: 2.538e+04 on 2 and 4997 DF, p-value: < 2.2e-16 ``` - Test for significance of `d.resid` suggests OLS is inconsistent in this case --- # Testing exclusion - Exclusion restriction says that your instrument does not directly affect your outcome - Potential testing ideas: - "zero-first-stage" (subsample on which you know the instrument does not affect the endogenous variable) - augmented regression of reduced-form effect with subset of instruments (overidentified models only) --- # Testing exogeneity - Only available in over-identified models - Sargan or Hansen's J test (null hypothesis is that instruments are correlated with residuals)