请教：stata中的Heckman命令如何执行？

14681

收藏 2005-09-05

例如：我想分析农户是否发生土地租佃，看作两个决策：第一个是是否租佃，是否租入用rentin表示，第二个是租佃规模，租入规模用landin表示，解释变量为age edu land等等，那么在Stata中用Heckman命令如何执行？我在stata8.0中打开后，从statistics中找到seleciton models，然后选择Heckman seleciton model(two step), 我不太明白selection DV是什么意思，必须在前面的方框里勾上，Selection independent variables中输入的变量与前面的Independent有什么区别？是否完全一样？Dependent variable是否输入landin？还有，Heckman seleciton model(ML）与Heckman seleciton model(two step）有何区别？刚学STATA，敬请高手指正，谢谢了。

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

蓝色

2005-9-5 12:57:00

看heckman的帮助啊。讲的很详细，还有例子。

-------------------------------------------------------------------------------------------
help forheckman                            manual:  [R]heckman
                                          dialogs:  heckman ml  heckman 2-step  predict
-------------------------------------------------------------------------------------------

Heckman selection model

Basic syntax:

      heckman depvar [varlist], select(varlist_s) [twostep]

   or

      heckman depvar [varlist], select(depvar_s = varlist_s) [twostep]

Full syntax for maximum-likelihood estimates only:

      heckman depvar [varlist] [weight] [if exp] [in range], select([depvar_s =]
            varlist_s [, offset(varname) noconstant]) [ robust cluster(varname)
            score(newvarlist|stub*) nshazard(newvarname) mills(newvarname)
            offset(varname) noconstant constraints(numlist) first noskip level(#)
            iterate(0) nolog maximize_options ]

Full syntax for Heckman's two-step consistent estimates only:

      heckman depvar [varlist] [if exp] [in range], twostep select([depvar_s =]
            varlist_s [, noconstant]) [ nshazard(newvarname) mills(newvarname)
            noconstant first level(#) [ rhosigma | rhotrunc | rholimited | rhoforce]
            ]

by ... : may be used with heckman; see help by.

pweights, aweights, fweights, and iweights are allowed; see help weights.  No weights
are allowed if twostep is specified.

heckman shares the features of all estimation commands; see help estcom.

The syntax of predict following heckman is

      predict [type] newvarname [if exp] [in range] [, statistic nooffset]

where statistic is

   xb          fitted values for regression equation; the default
      ycond          E(y | y observed)
      yexpected    E(y*), y taken to be 0 where unobserved
   nshazard       nonselection hazardor inverse Mills' ratio
   mills       nonselection hazard or inverse Mills' ratio
      psel          P(y observed)
   xbsel       linear prediction for selection equation
   stdpsel       standard errorof selection linear pred.
      pr(a,b)       Pr(y | a<y<b)
      e(a,b)       E(y | a<y<b)
      ystar(a,b)    E(y*), y* = max(a,min(y,b))
   stdp          standard error of the prediction
   stdf          standard error of the forecast

where a and b may be numbers or variables; a missing (a > .) means -infinity; and b
missing (b > .) means infinity.

These statistics are available both in and out of sample; type "predict ... if
esample(), ..." if wanted only for the estimation sample.

Description

heckman fits regression models with selection using either Heckman's two-step
consistent estimator or full maximum-likelihood.

Heckman estimates all of the parameters in the model:

      (regression equation: y is depvar, x is varlist)
      y = xb + u_1

      (selection equation: Z is varlist_s)
      y observed if Zg + u_2 > 0

      where:
            u_1 ~ N(0, sigma)
            u_2 ~ N(0, 1)
            corr(u_1, u_2) = rho

In the syntax for heckman, depvar and varlist are the dependent variable and
regressors for the underlying regression model (y = xb), and varlist_s are the
variables (Z) thought to determine whether depvar is selected/observed or unobserved.
By default, heckman will assume that missing values (see help missing) of depvar
imply that the dependent variable is unobserved (not selected).  With some datasets
it is more convenient to specify a binary variable (depvar_s) that identifies the
observations for which the dependent is observed/selected (depvar_s!=0) or not
observed (depvar_s==0); heckman will accommodate either type of data.

See help svyheckman for a survey version of heckman.

Options

select(...) specifies the variables and options for the selection equation.  It is an
      integral part of specifying a Heckman model and is not optional.

twostep specifies that Heckman's (1979) two-step efficient estimates of the
      parameters and covariance matrix (standard errors) of the model are to be
      produced.

robust specifies that the Huber/White/sandwich estimator of variance is to be used in
      place of the traditional calculation.  robust combined with cluster() further
      allows observations which are not independent within cluster (although they must
      be independent between clusters).  See [U] 23.14 Obtaining robust variance
      estimates.

cluster(varname) specifies that the observations are independent across groups
      (clusters) but not necessarily independent within groups.  varname specifies to
      which group each observation belongs; e.g., cluster(personid) in data with
      repeated observations on individuals.  cluster() can be used with pweights to
      produce estimates for unstratified cluster-sampled data.  Specifying cluster()
      implies robust.

score(newvarlist|stub*) creates new variables containing the contributions to the
      scores for each equation and ancillary parameter in the model; see [U] 23.15
      Obtaining scores.

      If score(newvarlist) is specified, four new variables must be provided.  If
      score(stub*) is specified, then variables stub1, stub2, stub3, and stub4 will be
      created.

      The first new variable will contain d(ln L_j)/d(x_j beta)
      The second, d(ln L_j)/d(z_j gamma)
      The third, d(ln L_j)/d(atanh(rho))
      The fourth, d(ln L_j)/d(ln(sigma))

nshazard(varname) and mills(varname) are synonyms, and either creates a new variable
      containing the nonselection hazard (what is often referred to as the inverse of
      the Mills' ratio) from the selection equation.  With the options twostep or
      iterate(0), the nonselection hazard is derived from a probit regression of
      whether the dependent variable is selected/observed.  Under full
      maximum-likelihood, the nonselection hazard is derived from the parameter
      estimates of the selection equation.

offset(varname) is a rarely used option that specifies a variable to be added
      directly to xb.  This option may be specified on either the regression or
      selection equation.

noconstant omits the constant term from the equations.  This option may be specified
      on either the regression equation or the selection equation.

constraints(numlist) specifies the linear constraints to be applied during
      estimation.  Constraints are defined using the constraint command and are
      numbered; see help constraint.  The default is to perform unconstrained
      estimation.  constraints() may not be specified with twostep.

first specifies that the first-step probit estimates of the selection equation be
      displayed prior to estimation.

noskip specifies that a full maximum-likelihood model with only a constant for the
      regression equation be fitted.  This model is not displayed but is used as the
      base model to compute a likelihood-ratio test for the model test statistic
      displayed in the estimationheader.  By default, the overall model test statistic
      is an asymptotically equivalent Wald test of all the parameters in the regression
      equation being zero (except the constant).  For many models, this option can
      significantly increase estimation time.

level(#) specifies the confidence level in percent for confidence intervals of the
      coefficients; see help level.

iterate(0) produces Heckman's (1979) two-step parameter estimates with standard
      errors computed from the inverse Hessian of the full information matrix at the
      two-step solution for the parameters.  As an alternative, the twostep option
      computes Heckman's two-step consistent estimates of the standard errors.
      iterate(#) can also be used to restrict the maximum number of iterations during
      optimization; see help maximize.

rhosigma, rhotrunc, rholimited, and rhoforce are rarely used options to specify how
      the two-step estimator, option twostep, handles unusual cases where the two-step
      estimate of rho is outside the admissible range for a correlation, [-1,1].  When
      rho is outside this range it is possible for the two-step estimate of the
      coefficient variance-covariance matrix to not be positive definite and thus
      unusable for testing.  The default is rhosigma.

      rhotrunc specifies that rho be truncated to lie in the range [-1,1].  If the
      two-step estimate is below -1, rho is set to -1; if the two-step estimate is
      above 1, rho is set to 1.  This truncated value of rho is used in all
      computations to estimate the two-step covariance matrix.

      rhosigma specifies that rho be truncated, as with option rhotrunc, and that the
      estimate of sigma be made consistent with rho_hat, the truncated estimate of rho.
      So, sigma_hat = B_m * rho_hat; see the Methods and Formulas section of [R]
      heckman for the definition of B_m.  Both the truncated rho and the new estimate
      of sigma_hat are used in all computations to estimate the two-step covariance
      matrix.

      rholimited specifies that rho be truncated only in the computation of the
      diagonal matrix D as it enters V_twostep and Q; see [R] heckman Methods and
      Formulas.  In all other computations, the untruncated estimate of rho is used.

      rhoforce specifies that the two-step estimate of rho be retained even if it is
      outside the admissible rangefor a correlation.  This may, in rare cases, lead to
      a nonpositive-definite covariance matrix.

      These options have no effect when estimation is by maximum likelihood, the
      default.  They also have no effect when the two-step estimate of rho is in the
      range [-1,1].

nolog suppresses the iteration log.

maximize_options control the maximization process; see help maximize.  You will
      likely never need to specify any of the maximize options except for iterate(0)
      and possibly difficult.  If the iteration log shows many "not concave" messages
      and it is taking many iterations to converge, you may want to try using the
      difficult option and see if that helps it to converge in fewer steps.

Options for predict

xb, the default, calculates the linear predictions from the underlying regression
      equation.

ycond calculates the expected value of the dependent variable conditional on the
      dependent variable being observed/selected; E(y | y observed).

yexpected calculates the expected value of the dependent variable (y*), where that
      value is taken to be 0 when it is expected to be unobserved; y* = P(y observed) *
      E(y | y observed).

      The assumption of 0 is valid for many cases where nonselection implies
      non-participation (e.g., unobserved wage levels, insurance claims from those who
      are uninsured, etc.) but may be inappropriate for some problems (e.g., unobserved
      disease incidence).

nshazard and mills are synonyms, either calculates the nonselection hazard -- what is
      often referred to as the inverse of the Mills' ratio.

psel calculates the probability of selection (or being observed):  P(y observed) =
      Pr(z_j*g + u_2j > 0).

xbsel calculates the linear prediction for the selection equation.

stdpsel calculates the standard error of the linear prediction for the selection
      equation.

pr(a,b) calculates the Pr(a < x*b+u_1 < b), the probability that y|x would be
      observed in the interval (a,b).

      a and b may be specified as numbers or variable names;
      pr(20,30) calculates Pr(20 < x*b+u_1 < 30);
      pr(lb,ub) calculates Pr(lb < x*b+u_1 < ub); and
      pr(20,ub) calculates Pr(20 < x*b+u_1 < ub).

      a missing (a > .) meansminus infinity; pr(.,30) calculates Pr(x*b+u_1 < 30) and
      pr(lb,30) calculates Pr(x*b+u_1 < 30) in observations for which lb > . (and
      calculates Pr(lb < x*b+u_1 < 30) elsewhere).

      b missing (b > .) means plus infinity; pr(20,.) calculates Pr(x*b+u_1 > 20) and
      pr(20,ub) calculates Pr(x*b+u_1 > 20) in observations for which ub > . (and
      calculates Pr(20 < x*b+u_1 < ub) elsewhere).

e(a,b) calculates E(x*b+u_1 | a < x*b+u_1 < b), the expected value of y|x conditional
      on y|x being in the interval (a,b), which is to say, y|x is censored.  a and b
      are specified as they are for pr().

ystar(a,b) calculates E(y*), where y* = a if x*b+u_1 < a, y* = b if xb+u > b, and
      y* = xb+u otherwise, which is to say, y* is truncated.  a and b are specified as
      they are for pr().

stdp calculates the standard error of the prediction from the underlying regression
      equation.

stdf calculates the standard error of the forecast of the underlying regression
      equation.  This is often informally referred to as the standard error of the
      prediction.  By construction, the standard errors produced by stdf are always
      larger than those by stdp; see [R] regress.

nooffset is relevant only if you specified offset() for heckman.  It modifies the
      calculations made by predict so that they ignore the offset variable; the linear
      prediction is treated as xb rather than xb + offset.

Examples

To obtain full ML estimates:

      . heckman wage educ age, select(married children educ age)

To obtain Heckman's two-step consistent estimates:

      . heckman wage educ age, select(married children educ age) twostep

To define and use each equation separately:

      . global wage_eqn wage educ age
      . global seleqn married children age
      . heckman $wage_eqn, select($seleqn)

To use a variable to identify selection:

      . heckman wage educ age, select(wageseen = married children educ age)

To use options:

      . heckman wage educ age, select(married children educ age), [pw=wgt]
      . heckman wage educ age, select(married children educ age) robust
      . heckman $wage_eqn, select($seleqn) cluster(county)
      . heckman $wage_eqn, select($seleqn) score(scr1 scr2 scr3 scr4)

      . heckman wage educ age, select(married children educ age) first

      . heckman $wage_eqn, select($seleqn) mills(mymills)

      . heckman wage educ age, noconstant select(married children educ age)
      . heckman wage educ age, select(married children educ age, noconstant)

Prediction:

      . heckman wage educ age, select(married children educ age)

      . predict yhat
      . predict yhat, xb

      . predict mystdp, stdp
      . predict mystdf, stdf

      . predict ycond, ycond
      . predict ystar, yexpected
      . predict probseen, psel
      . predict selindex, xbsel

      . predict mymill, mills

      . predict p0to20, pr(0,20)
      . predict less15, pr(.,15)
      . predict ey0to20, e(0,20)
      . predict ys0to20, ystar(0,20)

Also see

Manual:  [U] 23 Estimation and post-estimation commands,
         [U] 29 Overview of Stata estimation commands,
         [R] heckman

Online:  help for constraint, estcom, postest, svyheckman; heckprob, regress,
         svyheckprob, tobit, treatreg

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

hanszhu

2005-9-6 08:23:00

Hello,

I am trying to use Heckman model. I found that many people mentioned that the Proc Qlim can do it in SAS 9.1. But I checked the documents for Proc Qlim and counld not find any detailed information about Heckman model analysis. Anyone can give me more detailed information about the Proc Qlim for Heckman?

Thank you all for your help!

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

hanszhu

2005-9-6 08:24:00

I don't know about PROC QLIM, but a colleague has some IML code that does it. I think she found it on the web; if no one helps with QLIM, and you'd like the IML code, let me know and I will try to find it. Tangentially,The Heckman model has gotten very mixed reviews....economists seem to like it, others seem less impressed.

Peter L. Flom, PhD Assistant Director, Statistics and Data Analysis Core Center for Drug Use and HIV Research National Development and Research Institutes 71 W. 23rd St www.peterflom.com New York, NY 10010 (212) 845-4485 (voice) (917) 438-0894 (fax)

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

hanszhu

2005-9-6 08:25:00

If you go t

http://ftp.sas.com/techsup/download/stat/

You'll find a 'heckman' program. But it's not PROC QLIM. It uses Greene's correction to Heckman to get the adjusted standard errors. It uses PROC PROBIT, PROC REG, and *also* IML.

Peter mentioned that non-econ people tend not to be as thrilled with the Heckman model. So let me add another point. Heckman was *wrong*. He claimed that the OLS estimates would be smaller than the real standard errors. Greene (1981) showed that the OLS estimates could be larger or smaller or even the same size as the true errors.

What I find worrisome is the whole concept that you can really correct for an unknowable bias when using non-randomly selected samples. I simply don't believe this. You can make (possibly unwarranted) assumptions to model the bias, as Heckman did. But you can't evalaute those assumptions.

I also found this URL: http://www.stat.purdue.edu/~ywang/Introduction%20to%20Heckman%20Model.ppt In this is code which only uses PROC LOGISTIC, a data step, and PROC REG.In the PROC QLIM examples, there is an entry for Sample Selection Models. As you may guess by reading my whiny diatribe above, this shows how to fit a Heckman-like model using PROC QLIM. If you look unde the "Details" section, the "Selection Models" part shows the *exact* model to use to get the classical Heckman model.

HTH, David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

hanszhu

2005-9-6 08:27:00

Hi,

You've already gotten some thought-provoking feedback about the Heckman two-step (and if I ever go to a barn dance, I'm going to shout out a request to dance the Heckman Two-Step) from two of the gurus. I have my own complaints about it (it seems that whenever the Mills Ratio variable is significant, it has a positive coefficient, although sometimes the higher propensity subjects have a smaller effect size).

If you do want to continue in your quest to dance Heckman with SAS, you have at least two alternatives at your disposal:

1) David Jaeger's macro (http://support.sas.com/ctx/samples/index.jsp? sid=476&re="s/y/PROCS/reg/reg/s/y/PROCS/reg/reg") This requires SAS/STATand SAS/IML (and Base SAS, natch) to run the whole thing, but both the modeling steps are done with STAT PROCs (PROBIT and REG), with a data step in between to extract the Mills Ratios; the IML portion at the end is just to compute corrected standard errors.

2) PROC QLIM in SAS 9 (http://support.sas.com/onlinedoc/913/docMainpage.jsp). They consider the Heckman models part of the general class of selection models, which is true. So the details section of the PROC QLIM documentation on Selection Models has a bit of the Heckman theory, and there is an example of a Heckman-like model, "Example 22.4: Sample Selection Model" that you can look at. The code is pretty simple:

proc qlim data=mroz; model inlf = nwifeinc educ exper expersq age kidslt6 kidsge6 /discrete; model lwage = educ exper expersq / select(inlf=1);

run;

It's very straightforward to see what it's doing; the first model step is the binary discrete part, the second models the effect when the first dependent variable is 1. So give it a spin (and report back to us if you like, oh intrepid pioneer)!

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

点击查看更多内容…

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群