class: center, middle, inverse, title-slide .title[ # Introduction to Regression Discontinuity ] .subtitle[ ##
] .author[ ### Ian McCarthy | Emory University ] .date[ ### Econ 771, Fall 2022 ] --- class: inverse, center, middle <!-- Adjust some CSS code for font size and maintain R code font size --> <style type="text/css"> .remark-slide-content { font-size: 30px; padding: 1em 2em 1em 2em; } .remark-code { font-size: 15px; } .remark-inline-code { font-size: 20px; } </style> <!-- Set R options for how code chunks are displayed and load packages --> # The Basics <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1055px></html> --- # The Idea Key intuition from RD:<br> -- <br> Observations are <b>identical</b> just above/below threshold --- # The Idea Highly relevant in "rule-based" world... - School eligibility based on age cutoffs - Program participation based on discrete income thresholds - Performance scores rounded to nearest integer --- # Required elements 1. Score 2. Cutoff 3. Treatment --- # Types of RD 1. Sharp regression discontinuity - those above the threshold guaranteed to participate<br> -- <br> 2. Fuzzy regression discontinuity - those above the threshold are eligible but may not participate <!-- New Section --> --- class: inverse, center, middle # Sharp RD <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1055px></html> --- class: clear `$$D_{i} = 1(x_{i}>c) = \begin{cases} 1 & \text{if} & x_{i}>c \\ 0 & \text{if} & x_{i}<c \end{cases}$$` <br> - `\(x\)` is "forcing variable" - `\(c\)` is the threshold value or cutoff point --- # Sharp RD Scatterplot <img src="04-1_files/figure-html/rd-plot1-1.png" style="display: block; margin: auto;" /> --- # Sharp RD Linear Predictions <img src="04-1_files/figure-html/rd-plot2-1.png" style="display: block; margin: auto;" /> --- # Sharp RD Linear Predictions <img src="04-1_files/figure-html/rd-plot3-1.png" style="display: block; margin: auto;" /> --- # Different averages - Mean difference around threshold of 0.2, 3.97 - 2.25 = 1.72 - Mean overall difference, 3.74 - 1.49 = 2.25 --- # More generally - Running variable may affect outcome directly - Focusing on area around cutoff does two things:<br> -- <br> 1. Controls for running variable 2. "Controls" for unobserved things correlated with running variable and outcome --- # Animations! .center[ ![](pics/rd_animate.gif) ] --- # Identification - Need `\(E[Y_{i}|x_{i}]\)` to be sufficiently smooth in order to estimate effect of `\(D_{i}\)` on `\(Y_{i}\)` exactly at `\(x_{i}=c\)`. - Assumption: `\(E[Y_{i}(0)|x_{i}]\)` and `\(E[Y_{i}(1)|x_{i}]\)` are continuous in `\(x\)` `$$\delta^{CATE} = E[Y_{i}(1) - Y_{i}(0) | x_{i}=c] = \lim_{x \leftarrow c} E[Y_{i} | x_{i}] - \lim_{x \rightarrow c} E[Y_{i} | x_{i}]$$` --- # Estimation Goal is to estimate `\(E[Y_{1}|X=c] - E[Y_{0}|X=c]\)` 1. Trim to reasonable window around threshold ("bandwidth"), `\(X \in [c-h, c+h]\)` 2. Transform running variable, `\(\tilde{X}=X-c\)` 3. Estimate regressions... - Linear, same slope: `\(y = \alpha + \delta D + \beta \tilde{X} + \varepsilon\)` - Linear, different slope: `\(y = \alpha + \delta D + \beta \tilde{X} + \gamma D\tilde{X} + \varepsilon\)` - Nonlinear: add polynomials in `\(\tilde{X}\)` and interactions `\(D \tilde{X}\)` - Nonparametric kernel estimation --- # Kernels Some RD estimates talk about "kernel weighting" to assign more weight to observations closer to the threshold and less weight to observations further from the threshold. --- # Kernels `$$\hat{\mu}_{+}(x) = \frac{\sum_{i: X_{i}<c} Y_{i} \times K \left(\frac{X_{i} -x}{h} \right)}{\sum_{i: X_{i}<c} K \left(\frac{X_{i} -x}{h} \right)},$$` and `$$\hat{\mu}_{-}(x) = \frac{\sum_{i: X_{i}\geq c} Y_{i} \times K \left(\frac{X_{i} -x}{h} \right)}{\sum_{i: X_{i}\geq c} K \left(\frac{X_{i} -x}{h} \right)},$$` where `\(K(u)\)` is a kernel that assigns weight to observations based on the distance from `\(u\)`. A rectagular kernel is such that `\(K(u)=1/2\)` for `\(u \in (-1,1)\)` and 0 elsewhere. And `\(h\)` is our bandwidth. --- # Kernels and regression - Local linear regression (regression within the pre-specified bandwidth) is a kernel weighted regression with a uniform (or rectangular) kernel. - Could use more complicated kernels for a fully nonparametric approach, but these don't work well around the RD cutoff values. <!-- New Section --> --- class: inverse, center, middle # Regression Discontinuity in Practice <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1055px></html> --- # RDs "in the wild" Most RD estimates follow a similar set of steps: 1. Show clear graphical evidence of a change around the discontinuity (bin scatter) 2. Balance above/below threshold (use baseline covariates as outcomes) 3. Manipulation tests 4. RD estimates 5. Sensitivity and robustness: - Bandwidths - Order of polynomial - Inclusion of covariates --- class: center, middle # 1. Graphical evidence --- class: clear Before presenting RD estimates, **any** good RD approach first highlights the discontinuity with a simple graph. We can do so by plotting the average outcomes within bins of the forcing variable (i.e., binned averages), `$$\bar{Y}_{k} = \frac{1}{N_{k}}\sum_{i=1}^{N} Y_{i} \times 1(b_{k} < X_{i} \leq b_{k+1}).$$`<br> -- The binned averages helps to remove noise in the graph and can provide a cleaner look at the data. Just make sure that no bin includes observations above and below the cutoff! --- # Binned average calculation ```r library(rdrobust) rd.result <- rdplot(rd.dat$Y, rd.dat$X, c=1, title="RD Plot with Binned Average", x.label="Running Variable", y.label="Outcome") ``` ```r bin.avg <- as_tibble(rd.result$vars_bins) plot.bin <- bin.avg %>% ggplot(aes(x=rdplot_mean_x,y=rdplot_mean_y)) + geom_point() + theme_bw() + geom_vline(aes(xintercept=1),linetype='dashed') + scale_x_continuous( breaks = c(.5, 1.5), label = c("Untreated", "Treated") ) + xlab("Running Variable") + ylab("Outcome") ``` --- # Binned average plot <img src="04-1_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # With and without binning .pull-left[ <img src="04-1_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="04-1_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- # Selecting "bin" width 1. Dummy variables: Create dummies for each bin, regress the outcome on the set of all dummies and form r-square `\(R^{2}_{r}\)`, repeat with double the number of bins and find r-square value `\(R^{2}_{u}\)`, form F-stat, `\(\frac{R^{2}_{u}-R^{2}_{r}}{1-R^{2}_{u}}\times \frac{n-K-1}{K}\)`. 2. Interaction terms: Include interactions between dummies and the running variable, joint F-test for the interaction terms If F-test suggests significance, then we have too few bins and need to narrow the bin width. --- class: center, middle # 2. Balance --- # Assessing balance - If RD is an appropriate design, passing the cutoff should **only** affect treatment and outcome of interest - How do we test for this? - Covariate balance - Placebo tests of other outcomes (e.g., t-1 outcomes against treatment at time t) --- class: center, middle # 3. Manipulation tests --- # Manipulation of running variable - Individuals should not be able to **precisely** manipulate running variable to enter into treatment - Sometimes discussed as "bunching" - Test for differences in density to left and right of cutoffs (`rddensity` in Stata and R) - Permutation tests proposed in Ganong and Jager (2017) --- # What if bunching exists? - Gerard, Rokkanen, and Rothe (2020) suggest partial identification allowing for bunching - Can also be used as a robustness check - `rdbounds` in Stata and R - Assumption: bunching only moves people in one direction --- class: center, middle # 4. RD Estimation --- # Baseline RD estimates Start with the "default" options - Local linear regression - Optimal bandwidth - Uniform kernel --- # Selecting bandwidth in local linear regression The bandwidth is a "tuning parameter" - High `\(h\)` means high bias but lower variance (use more of the data, closer to OLS) - Low `\(h\)` means low bias but higher variance (use less data, more focused around discontinuity)<br> -- Represent bias-variance tradeoff with the mean-square error, `$$MSE(h) = E[(\hat{\tau}_{h} - \tau_{RD})^2]=\left(E[\hat{\tau}_{h} - \tau_{RD}] \right)^2 + V(\hat{\tau}_{h}).$$` --- # Selecting bandwidth In the RD case, we have two different mean-square error terms: 1. "From above", `\(MSE_{+}(h) = E[(\hat{\mu}_{+}(c,h) - E[Y_{1i}|X_{i}=c])^2]\)` 2. "From below", `\(MSE_{-}(h) = E[(\hat{\mu}_{-}(c,h) - E[Y_{0i}|X_{i}=c])^2]\)`<br> -- Goal is to find `\(h\)` that minimizes these values, but we don't know the true `\(E[Y_{1}|X=c]\)` and `\(E[Y_{0}|X=c]\)`. So we have two approaches: 1. Use **cross-validation** to choose `\(h\)` 2. Explicitly solve for optimal bandwidth --- # Cross-validation Essentially a series of "leave-one-out" estimates: 1. Pick an `\(h\)` 2. Run regression, leaving out observation `\(i\)`. If `\(i\)` is to the left of the threshold, we estimate regression for observations within `\(X_{i}-h\)`, and conversely `\(X_{i}+h\)` if `\(i\)` is to the right of the threshold. 3. Predicted `\(\hat{Y}_{i}\)` at `\(X_{i}\)` (out of sample prediction for the left out observation) 4. Do this for all `\(i\)`, and form `\(CV(h)=\frac{1}{N}\sum (Y_{i} - \hat{Y}_{i})^2\)` <br> -- Select `\(h\)` with lowest `\(CV(h)\)` value. --- # Back to simulated data <img src="04-1_files/figure-html/rd-real1-1.png" style="display: block; margin: auto;" /> --- # Back to simulated data ```r ols <- lm(Y~X+W, data=rd.dat) rd.dat3 <- rd.dat %>% mutate(x_dev = X-1) %>% filter( (X>0.8 & X <1.2) ) rd <- lm(Y~x_dev + W, data=rd.dat3) ``` - True effect: 1.5 - Standard linear regression with same slopes: 1.68 - RD (linear with same slopes): 1.58 --- # RD with built-in commands .pull-left[ ``` ## Sharp RD estimates using local polynomial regression. ## ## Number of Obs. 1000 ## BW type mserd ## Kernel Triangular ## VCE method NN ## ## Number of Obs. 482 518 ## Eff. Number of Obs. 146 187 ## Order est. (p) 1 1 ## Order bias (q) 2 2 ## BW est. (h) 0.330 0.330 ## BW bias (b) 0.476 0.476 ## rho (h/b) 0.693 0.693 ## Unique Obs. 482 518 ## ## ============================================================================= ## Method Coef. Std. Err. z P>|z| [ 95% C.I. ] ## ============================================================================= ## Conventional 1.593 0.108 14.732 0.000 [1.381 , 1.805] ## Robust - - 12.530 0.000 [1.351 , 1.852] ## ============================================================================= ``` ] .pull-right[ Cattaneo et al. (2020) argue: - Report conventional point estimate - Report robust confidence interval ] --- class: center, middle # 5. Robustness and sensitivity --- # Other options - Different bandwidths - Different kernels or polynomials - Role of covariates in RD estimates --- # Pitfalls of polynomials - Assign too much weight to points away from the cutoff - Results **highly** sensitive to degree of polynomial - Narrow confidence intervals (over-rejection of the null) For more discussion, see this [World Bank Blog post](https://blogs.worldbank.org/impactevaluations/curves-all-wrong-places-gelman-and-imbens-why-not-use-higher-order-polynomials-rd) --- class: inverse, center, middle # Fuzzy RD <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1055px></html> --- # The Idea "Fuzzy" just means that assignment isn't guaranteed based on the running variable. For example, maybe students are much more likely to get a scholarship past some threshold SAT score, but it remains possible for students below the threshold to still get the scholarship. - Discontinuity reflects a jump in the probability of treatment - Other RD assumptions still required (namely, can't manipulate running variable around the threshold) --- # Fuzzy RD is IV In practice, fuzzy RD is employed as an instrumental variables estimator - Difference in outcomes among those above and below the discontinuity divided by the difference in treatment probabilities for those above and below the discontinuity,<br> `\(E[Y_{i} | D_{i}=1] - E[Y_{i} | D_{i}=0] = \frac{E[Y_{i} | x_{i}\geq c] - E[Y_{i} | x_{i}< c]}{E[D_{i} | x_{i}\geq c] - E[D_{i} | x_{i}<c]}\)` - Indicator for `\(x_{i}\geq c\)` is an instrument for treatment status, `\(D_{i}\)`. - Implemented with `rdrobust` and `fuzzy=t` option