看heckman的帮助啊。讲的很详细,还有例子。
-------------------------------------------------------------------------------------------
help forheckman manual: [R]heckman
dialogs: heckman ml heckman 2-step predict
-------------------------------------------------------------------------------------------
Heckman selection model
Basic syntax:
heckman depvar [varlist], select(varlist_s) [twostep]
or
heckman depvar [varlist], select(depvar_s = varlist_s) [twostep]
Full syntax for maximum-likelihood estimates only:
heckman depvar [varlist] [weight] [if exp] [in range], select([depvar_s =]
varlist_s [, offset(varname) noconstant]) [ robust cluster(varname)
score(newvarlist|stub*) nshazard(newvarname) mills(newvarname)
offset(varname) noconstant constraints(numlist) first noskip level(#)
iterate(0) nolog maximize_options ]
Full syntax for Heckman's two-step consistent estimates only:
heckman depvar [varlist] [if exp] [in range], twostep select([depvar_s =]
varlist_s [, noconstant]) [ nshazard(newvarname) mills(newvarname)
noconstant first level(#) [ rhosigma | rhotrunc | rholimited | rhoforce]
]
by ... : may be used with heckman; see help by.
pweights, aweights, fweights, and iweights are allowed; see help weights. No weights
are allowed if twostep is specified.
heckman shares the features of all estimation commands; see help estcom.
The syntax of predict following heckman is
predict [type] newvarname [if exp] [in range] [, statistic nooffset]
where statistic is
xb fitted values for regression equation; the default
ycond E(y | y observed)
yexpected E(y*), y taken to be 0 where unobserved
nshazard nonselection hazardor inverse Mills' ratio
mills nonselection hazard or inverse Mills' ratio
psel P(y observed)
xbsel linear prediction for selection equation
stdpsel standard errorof selection linear pred.
pr(a,b) Pr(y | a<y<b)
e(a,b) E(y | a<y<b)
ystar(a,b) E(y*), y* = max(a,min(y,b))
stdp standard error of the prediction
stdf standard error of the forecast
where a and b may be numbers or variables; a missing (a > .) means -infinity; and b
missing (b > .) means infinity.
These statistics are available both in and out of sample; type "predict ... if
esample(), ..." if wanted only for the estimation sample.
Description
heckman fits regression models with selection using either Heckman's two-step
consistent estimator or full maximum-likelihood.
Heckman estimates all of the parameters in the model:
(regression equation: y is depvar, x is varlist)
y = xb + u_1
(selection equation: Z is varlist_s)
y observed if Zg + u_2 > 0
where:
u_1 ~ N(0, sigma)
u_2 ~ N(0, 1)
corr(u_1, u_2) = rho
In the syntax for heckman, depvar and varlist are the dependent variable and
regressors for the underlying regression model (y = xb), and varlist_s are the
variables (Z) thought to determine whether depvar is selected/observed or unobserved.
By default, heckman will assume that missing values (see help missing) of depvar
imply that the dependent variable is unobserved (not selected). With some datasets
it is more convenient to specify a binary variable (depvar_s) that identifies the
observations for which the dependent is observed/selected (depvar_s!=0) or not
observed (depvar_s==0); heckman will accommodate either type of data.
See help svyheckman for a survey version of heckman.
Options
select(...) specifies the variables and options for the selection equation. It is an
integral part of specifying a Heckman model and is not optional.
twostep specifies that Heckman's (1979) two-step efficient estimates of the
parameters and covariance matrix (standard errors) of the model are to be
produced.
robust specifies that the Huber/White/sandwich estimator of variance is to be used in
place of the traditional calculation. robust combined with cluster() further
allows observations which are not independent within cluster (although they must
be independent between clusters). See [U] 23.14 Obtaining robust variance
estimates.
cluster(varname) specifies that the observations are independent across groups
(clusters) but not necessarily independent within groups. varname specifies to
which group each observation belongs; e.g., cluster(personid) in data with
repeated observations on individuals. cluster() can be used with pweights to
produce estimates for unstratified cluster-sampled data. Specifying cluster()
implies robust.
score(newvarlist|stub*) creates new variables containing the contributions to the
scores for each equation and ancillary parameter in the model; see [U] 23.15
Obtaining scores.
If score(newvarlist) is specified, four new variables must be provided. If
score(stub*) is specified, then variables stub1, stub2, stub3, and stub4 will be
created.
The first new variable will contain d(ln L_j)/d(x_j beta)
The second, d(ln L_j)/d(z_j gamma)
The third, d(ln L_j)/d(atanh(rho))
The fourth, d(ln L_j)/d(ln(sigma))
nshazard(varname) and mills(varname) are synonyms, and either creates a new variable
containing the nonselection hazard (what is often referred to as the inverse of
the Mills' ratio) from the selection equation. With the options twostep or
iterate(0), the nonselection hazard is derived from a probit regression of
whether the dependent variable is selected/observed. Under full
maximum-likelihood, the nonselection hazard is derived from the parameter
estimates of the selection equation.
offset(varname) is a rarely used option that specifies a variable to be added
directly to xb. This option may be specified on either the regression or
selection equation.
noconstant omits the constant term from the equations. This option may be specified
on either the regression equation or the selection equation.
constraints(numlist) specifies the linear constraints to be applied during
estimation. Constraints are defined using the constraint command and are
numbered; see help constraint. The default is to perform unconstrained
estimation. constraints() may not be specified with twostep.
first specifies that the first-step probit estimates of the selection equation be
displayed prior to estimation.
noskip specifies that a full maximum-likelihood model with only a constant for the
regression equation be fitted. This model is not displayed but is used as the
base model to compute a likelihood-ratio test for the model test statistic
displayed in the estimationheader. By default, the overall model test statistic
is an asymptotically equivalent Wald test of all the parameters in the regression
equation being zero (except the constant). For many models, this option can
significantly increase estimation time.
level(#) specifies the confidence level in percent for confidence intervals of the
coefficients; see help level.
iterate(0) produces Heckman's (1979) two-step parameter estimates with standard
errors computed from the inverse Hessian of the full information matrix at the
two-step solution for the parameters. As an alternative, the twostep option
computes Heckman's two-step consistent estimates of the standard errors.
iterate(#) can also be used to restrict the maximum number of iterations during
optimization; see help maximize.
rhosigma, rhotrunc, rholimited, and rhoforce are rarely used options to specify how
the two-step estimator, option twostep, handles unusual cases where the two-step
estimate of rho is outside the admissible range for a correlation, [-1,1]. When
rho is outside this range it is possible for the two-step estimate of the
coefficient variance-covariance matrix to not be positive definite and thus
unusable for testing. The default is rhosigma.
rhotrunc specifies that rho be truncated to lie in the range [-1,1]. If the
two-step estimate is below -1, rho is set to -1; if the two-step estimate is
above 1, rho is set to 1. This truncated value of rho is used in all
computations to estimate the two-step covariance matrix.
rhosigma specifies that rho be truncated, as with option rhotrunc, and that the
estimate of sigma be made consistent with rho_hat, the truncated estimate of rho.
So, sigma_hat = B_m * rho_hat; see the Methods and Formulas section of [R]
heckman for the definition of B_m. Both the truncated rho and the new estimate
of sigma_hat are used in all computations to estimate the two-step covariance
matrix.
rholimited specifies that rho be truncated only in the computation of the
diagonal matrix D as it enters V_twostep and Q; see [R] heckman Methods and
Formulas. In all other computations, the untruncated estimate of rho is used.
rhoforce specifies that the two-step estimate of rho be retained even if it is
outside the admissible rangefor a correlation. This may, in rare cases, lead to
a nonpositive-definite covariance matrix.
These options have no effect when estimation is by maximum likelihood, the
default. They also have no effect when the two-step estimate of rho is in the
range [-1,1].
nolog suppresses the iteration log.
maximize_options control the maximization process; see help maximize. You will
likely never need to specify any of the maximize options except for iterate(0)
and possibly difficult. If the iteration log shows many "not concave" messages
and it is taking many iterations to converge, you may want to try using the
difficult option and see if that helps it to converge in fewer steps.
Options for predict
xb, the default, calculates the linear predictions from the underlying regression
equation.
ycond calculates the expected value of the dependent variable conditional on the
dependent variable being observed/selected; E(y | y observed).
yexpected calculates the expected value of the dependent variable (y*), where that
value is taken to be 0 when it is expected to be unobserved; y* = P(y observed) *
E(y | y observed).
The assumption of 0 is valid for many cases where nonselection implies
non-participation (e.g., unobserved wage levels, insurance claims from those who
are uninsured, etc.) but may be inappropriate for some problems (e.g., unobserved
disease incidence).
nshazard and mills are synonyms, either calculates the nonselection hazard -- what is
often referred to as the inverse of the Mills' ratio.
psel calculates the probability of selection (or being observed): P(y observed) =
Pr(z_j*g + u_2j > 0).
xbsel calculates the linear prediction for the selection equation.
stdpsel calculates the standard error of the linear prediction for the selection
equation.
pr(a,b) calculates the Pr(a < x*b+u_1 < b), the probability that y|x would be
observed in the interval (a,b).
a and b may be specified as numbers or variable names;
pr(20,30) calculates Pr(20 < x*b+u_1 < 30);
pr(lb,ub) calculates Pr(lb < x*b+u_1 < ub); and
pr(20,ub) calculates Pr(20 < x*b+u_1 < ub).
a missing (a > .) meansminus infinity; pr(.,30) calculates Pr(x*b+u_1 < 30) and
pr(lb,30) calculates Pr(x*b+u_1 < 30) in observations for which lb > . (and
calculates Pr(lb < x*b+u_1 < 30) elsewhere).
b missing (b > .) means plus infinity; pr(20,.) calculates Pr(x*b+u_1 > 20) and
pr(20,ub) calculates Pr(x*b+u_1 > 20) in observations for which ub > . (and
calculates Pr(20 < x*b+u_1 < ub) elsewhere).
e(a,b) calculates E(x*b+u_1 | a < x*b+u_1 < b), the expected value of y|x conditional
on y|x being in the interval (a,b), which is to say, y|x is censored. a and b
are specified as they are for pr().
ystar(a,b) calculates E(y*), where y* = a if x*b+u_1 < a, y* = b if xb+u > b, and
y* = xb+u otherwise, which is to say, y* is truncated. a and b are specified as
they are for pr().
stdp calculates the standard error of the prediction from the underlying regression
equation.
stdf calculates the standard error of the forecast of the underlying regression
equation. This is often informally referred to as the standard error of the
prediction. By construction, the standard errors produced by stdf are always
larger than those by stdp; see [R] regress.
nooffset is relevant only if you specified offset() for heckman. It modifies the
calculations made by predict so that they ignore the offset variable; the linear
prediction is treated as xb rather than xb + offset.
Examples
To obtain full ML estimates:
. heckman wage educ age, select(married children educ age)
To obtain Heckman's two-step consistent estimates:
. heckman wage educ age, select(married children educ age) twostep
To define and use each equation separately:
. global wage_eqn wage educ age
. global seleqn married children age
. heckman $wage_eqn, select($seleqn)
To use a variable to identify selection:
. heckman wage educ age, select(wageseen = married children educ age)
To use options:
. heckman wage educ age, select(married children educ age), [pw=wgt]
. heckman wage educ age, select(married children educ age) robust
. heckman $wage_eqn, select($seleqn) cluster(county)
. heckman $wage_eqn, select($seleqn) score(scr1 scr2 scr3 scr4)
. heckman wage educ age, select(married children educ age) first
. heckman $wage_eqn, select($seleqn) mills(mymills)
. heckman wage educ age, noconstant select(married children educ age)
. heckman wage educ age, select(married children educ age, noconstant)
Prediction:
. heckman wage educ age, select(married children educ age)
. predict yhat
. predict yhat, xb
. predict mystdp, stdp
. predict mystdf, stdf
. predict ycond, ycond
. predict ystar, yexpected
. predict probseen, psel
. predict selindex, xbsel
. predict mymill, mills
. predict p0to20, pr(0,20)
. predict less15, pr(.,15)
. predict ey0to20, e(0,20)
. predict ys0to20, ystar(0,20)
Also see
Manual: [U] 23 Estimation and post-estimation commands,
[U] 29 Overview of Stata estimation commands,
[R] heckman
Online: help for constraint, estcom, postest, svyheckman; heckprob, regress,
svyheckprob, tobit, treatreg