------------------------------------------------------------------------------------------------------------------------
help for rd
------------------------------------------------------------------------------------------------------------------------
Regression discontinuity (RD) estimator
Syntax
rd [varlist] [if] [in] [weight] [, options]
where varlist has the form outcomevar [treatmentvar] assignmentvar
+---------+
----+ Weights +---------------------------------------------------------------------------------------------------
aweights, fweights, and pweights are allowed; see help weights. Under Stata versions 9.2 or before (using locpoly
to construct local regression estimates) aweights and pweights will be converted to fweights automatically and the
data expanded. If this would exceed system memory limits, error r(901) will be issued; in this case, the user is
advised to round weights. In any case, the validity of bootstrapped standard errors will depend on the expanded
data correctly representing sampling variability, which may require rounding or replacing weight variables. Under
Stata versions 10 or later (using lpoly to construct local regression estimates), all weights will be treated as
aweights.
bs [, options]: rd varlist [if] [in] [weight] [, options]
+----------------------------+
----+ Table of Further Contents +--------------------------------------------------------------------------------
General description of estimator
Examples
Detailed syntax
Description of options
Remarks and saved results
References
Acknowledgements
Citation of rd
Author information
+-------------+
----+ Description +-----------------------------------------------------------------------------------------------
rd implements a set of regression-discontinuity estimation methods that are thought to have very good internal validity,
for estimating the causal effect of some explanatory variable (called the treatment variable) for a particular
subpopulation, under some often plausible assumptions. In this sense, it is much like an experimental design, except
that levels of the treatment variable are not assigned randomly by the researcher. Instead, there is a jump in the
conditional mean of the treatment variable at a known cutoff in another variable, called the assignment variable, which
is perfectly observed, and this allows us to estimate the effect of treatment as if it were randomly assigned in the
neighborhood of the known cutoff.
rd is an alternative to various regression techniques that purport to allow causal inference (e.g. panel methods such as
xtreg), instrumental variables (IV) and other IV-type methods (see the ivreg2 help file and references therein), and
matching estimators (see the psmatch2 and nnmatch help files and references therein). The rd approach is in fact an IV
model with one exogenous variable excluded from the regression (excluded instrument), an indicator for the assignment
variable above the cutoff, and one endogenous regressor (the treatment variable).
rd estimates local linear or kernel regression models on both sides of the cutoff, using a triangle kernel. Estimates
are sensitive to the choice of bandwidth, so by default several estimates are constructed using different bandwidths. In
practice, rd uses kernel-weighted suest (or ivreg if suest fails) to estimate the local linear regressions and reports
analytic SE based on the regressions.
Further discussion of rd appears in Nichols (2007).
+----------+
----+ Examples +--------------------------------------------------------------------------------------------------
In the simplest case, assignment to treatment depends on a variable Z being above a cutoff Z0. Frequently, Z is defined
so that Z0=0. In this case, treatment is 1 for Z>=0 and 0 for Z<0, and we estimate local linear regressions on both
sides of the cutoff to obtain estimates of the outcome at Z=0. The difference between the two estimates (for the
samples where Z>=0 and where Z<0) is the estimated effect of treatment.
For example, having a Democratic representative in the US Congress may be considered a treatment applied to a
Congressional district, and the assignment variable Z is the vote share garnered by the Democratic candidate. At Z=50%,
the probability of treatment=1 jumps from zero to one. Suppose we are interested in the effect a Democratic
representative has on the federal spending within a Congressional district. rd estimates local linear regressions on
both sides of the cutoff like so:
ssc inst rd, replace
net get rd
use votex
rd lne d, gr mbw(100)
rd lne d, gr mbw(100) line(`"xla(-.2 "Repub" 0 .3 "Democ", noticks)"')
rd lne d, gr ddens
rd lne d, mbw(25(25)300) bdep ox
rd lne d, x(pop-vet)
In a fuzzy RD design, the conditional mean of treatment jumps at the cutoff, and that jump forms the denominator of a
Local Wald Estimator. The numerator is the jump in the outcome, and both are reported along with their ratio. The sharp
RD design is a special case of the fuzzy RD design, since the denominator in the sharp case is just one.
g byte ranwin=cond(uniform()<.1,1-win,win)
rd lne ranwin d, mbw(100)
The default bandwidth from Imbens and Kalyanaraman (2009) is designed to minimize MSE, or squared bias plus variance, in
a sharp RD design. Note that a smaller bandwidth tends to produce lower bias and higher variance. The optimal bandwidth
will tend to be larger for a fuzzy design due to the additional variance arising from the estimation of the jump in the
conditional mean of treatment. Unfortunately, a larger bandwidth also leads to additional bias, which will be greater
if the curvature of the response function is greater (meaning that a linear regression over a larger range is a poorer
approximation). The increase in squared bias due to dividing by the estimated jump in the conditional mean of treatment
(using observations away from the discontinuity) can easily dominate the increase in variance and lead to the optimal
bandwidth in a fuzzy design to be smaller than in the sharp design. No clear guidance is offered; conducting
simulations using plausible generating functions for your specific application are highly recommended. The rd option
bdep facilitates visualizing the dependence of the estimate on bandwidth.
rd lne ranwin d, mbw(25(25)300) bdep ox
+-----------------------------+
----+ Detailed Syntax and Options +-------------------------------------------------------------------------------
There should be two or three variables specified after the rd command; if two are specified, a sharp RD design is
assumed, where the treatment variable jumps from zero to one at the cutoff. If no variables are specified after the rd
command, the estimates table is displayed.
rd outcomevar [treatmentvar] assignmentvar [if] [in] [weight] [, options]
+-----------------+
----+ Options summary +-------------------------------------------------------------------------------------------
mbw(numlist) specifies a list of multiples for bandwidths, in percentage terms. The default is "100 50 200" (i.e. half
and twice the requested bandwidth) and 100 is always included in the list, regardless of whether it is specified.
z0(real) specifies the cutoff Z0 in assignmentvar Z.
strineq specifies that mean treatment differs at Z0 from all Z>Z0 (e.g. treatment is 1 for Z>0 and 0 for Z<=0); the
default assumption is that mean treatment differs at Z0 from all Z<Z0 (e.g. treatment is 1 for Z>=0 and 0 for Z<0).
x(varlist) requests estimates of jumps in control variables varlist.
ddens requests a computation of a discontinuity in the density of Z. This is computed in a relatively ad hoc way, and
should be redone using McCrary's test described at
http://www.econ.berkeley.edu/~jmccrary/DCdensity/.
s(stubname) requests that estimates be saved as new variables beginning with stubname.
graph requests that local linear regression graphs for each bandwidth be produced.
noscatter suppresses the scatterplot on those graphs.
scopt(string) supplies an option list to the scatter plot.
lineopt(string) supplies an option list to the overlaid line plots.
n(real) specifies the number of points at which to calculate local linear regressions. The default is to calculate the
regressions at 50 points above the cutoff, with equal steps in the grid, and to use equal steps below the cutoff,
with the number of points determined by the step size.
bwidth(real) allows specification of a bandwidth for local linear regressions. The default is to use the estimated
optimal bandwidth for a "sharp" design as given by Imbens and Kalyanaraman (2009). The optimal bandwidth minimizes
MSE, or squared bias plus variance, where a smaller bandwidth tends to produce lower bias and higher variance. Note
that the optimal bandwidth will often tend to be larger for a fuzzy design, due to the additional variance that
arises from the estimation of the jump in the conditional mean of treatment.
bdep requests a graph of estimates versus bendwidths.
oxline adds a vertical line at the default bandwidth.
kernel(rectangle) requests the use of a rectangle (uniform) kernel. The default is a triangle (edge) kernel.
covar(varlist) adds covariates to Local Wald Estimation, which is generally a Very Bad Idea. It is possible that
covariates could reduce residual variance and improve efficiency, but estimation error in their coefficients could
also reduce efficiency, and any violations of the assumptions that such covariates are exogenous and have a linear
impact on mean treatment and outcomes could greatly increase bias.
+---------------------------+
----+ Remarks and saved results +---------------------------------------------------------------------------------
To facilitate bootstrapping, rd saves the following results in e():
Scalars
e(N) Number of observations used in estimation
e(w) Bandwidth in base model; other bandwidths are reported in e.g. e(w50) for the 50% multiple.
Macros
e(cmd) rd
e(rdversion) Version number of rd
e(depvar) Name of dependent variable
Matrices
e(b) Coefficient vector of estimated jumps in variables at different percentage bandwidth multiples
Functions
e(sample) Marks estimation sample
References
Many references appear in
Nichols, Austin. 2007. Causal Inference with Observational Data. Stata Journal 7(4): 507-541.
but the interested reader is directed also to
Imbens, Guido and Thomas Lemieux. 2007. "Regression Discontinuity Designs: A Guide to Practice." NBER Working
Paper 13039.
McCrary, Justin. 2007. "Manipulation of the Running Variable in the Regression Discontinuity Design: A Density
Test." NBER Technical Working Paper 334.
Shadish, William R., Thomas D. Cook, and Donald T. Campbell. 2002. Experimental and Quasi-Experimental Designs
for Generalized Causal Inference. Boston: Houghton Mifflin.
Fuji, Daisuke, Guido Imbens, and Karthik Kalyanaraman. 2009. "Notes for Matlab and Stata Regression Discontinuity
Software."
http://www.economics.harvard.edu/faculty/imbens/software_imbens
Imbens, Guido, and Karthik Kalyanaraman. 2009. "Optimal Bandwidth Choice for the Regression Discontinuity
Estimator." NBER WP 14726.
Acknowledgements
I would like to thank Justin McCrary for helpful discussions. Any errors are my own.
The optimal bandwidth calculations are from Fuji, Imbens, and Kalyanaraman (2009), available at
http://www.economics.harvard.edu/faculty/imbens/software_imbens.
Citation of rd
rd is not an official Stata command. It is a free contribution to the research community, like a paper. Please cite it
as such:
Nichols, Austin. 2011. rd 2.0: Revised Stata module for regression discontinuity estimation.
http://ideas.repec.org/c/boc/bocode/s456888.html
Author
Austin Nichols
Urban Institute
Washington, DC, USA
austinnichols@gmail.com
Also see
Manual: [U] 23 Estimation and post-estimation commands
[R] bootstrap
[R] lpoly in Stata 10, else locpoly (findit locpoly to install)
[R] ivregress in Stata 10, else [R] ivreg
[R] regress
[XT] xtreg
On-line: help for (if installed) rd_obs (prior version of rd), ivreg2, overid, ivendog, ivhettest, ivreset, xtivreg2,
xtoverid, ranktest, condivreg; psmatch2, nnmatch.