viking1111 发表于 2010-1-31 19:24 
5# bobguy
呵呵,看出来你的确是这方面的一个行家,这么专业,呵呵。
是这样的,样本点是从中国A股市场全部股票中根据国家相关标准筛选的,全部中小符合要求的一共只有259家,全在这里,所以没有抽样的问题存在。
259家中ST90,正常公司169,这个比例应该是自然配比。但问题是,我的目的是要预测企业违约风险。我想用2005-2008年的数据对2009年是否被ST进行回归,得到的公式可以用于对未来年份的违约概率进行预测。所以90家ST公司,只能选2009年被ST的29家,为了保证自然配比,正常公司的样本只能相应减少了。
所以,我想知道能不能用什么计量方法可以把169家正常公司全用上,减少信息损失。这几天我想出来两个方法。一个是用Bootstrap,对2009年被ST的公司进行反复抽样;方法二是因变量加上2008年被ST的公司,对应的自变量从2005-2008相应的提前到2004--2007,这样样本量就扩大了一倍。但问题是因为我想用Panel Data,所以怕第二种方法不合适。但第一种也不很理想。呵呵,还得请您指点
259 data points with goods=169 and bads=90 should be able do your analysis. Now you have all data points of 全部中小企业, it IS population data. There is no sampling here. So the bootstrap is useless. There is no needs to have 自然配比=1.There are many logistic analyses in which the research interesting is only the treatment effect rather that the probablity of being bads. When 自然配比=1, the biased estimator is the constant term and hance the probablity of being bads/goods(only true under logistics assumptions). It can be easy proved by a small simulation programs. The therotical proof can be found in
page 90 of Limited-dependent and qualitative variables in econometrics by Maddala. Actually the proof is easy only involving the math of a middle-school.
The advantage 自然配比=1 is
1) reduce cost --- expecially in medical research
2) the efficient loss is much less than a random sample --- events contains more information than nonevents becuase the number of events is much less than that of nonevents.
3) it is less efficient than using all data. --- this is simple true because sample with 自然配比=1 is a sub-sample of all data.
These points are useless in your case.
I would suggest that,
1) build a model with all data
2) build a model with data up to 2008
Contrasting these two models it may give you some hints about your models.
The data quality may need to pay more attension rather than quantity which you can do nothing about it.
Hope this will shed some lights on your research project.