因变量为二分类变量,个体水平变量包括教育、就业等,有地区id。用两种分析方法:1)将地区作为哑变量纳入logistic回归,这样分析发现存在地区差异
2)进行二层次logistic回归,stata中命令为:"melogit y x1 x2... || citynum:,or" 这里citynum就是地区id
这两种方法得到的结果如下:
方法1:
Logistic regression Number of obs = 1051
LR chi2(30) = 231.66
Prob > chi2 = 0.0000
Log likelihood = -581.91416 Pseudo R2 = 0.1660
| b | ci95 |
| survey2 | 1.349 | [0.882,2.064] |
| urban | 5.781*** | [2.827,11.824] |
| turban | 0.202*** | [0.098,0.417] |
| biragegr2 | 1.759 | [0.771,4.011] |
| biragegr3 | 1.827 | [0.797,4.189] |
| biragegr4 | 1.397 | [0.540,3.612] |
| pone | 1.417* | [0.950,2.111] |
| ncd | 0.831 | [0.369,1.869] |
| hanmajor | 0.456** | [0.249,0.837] |
| marital | 1.762 | [0.500,6.213] |
| edugr2 | 3.081** | [1.021,9.300] |
| edugr3 | 3.208* | [0.967,10.646] |
| edugr4 | 4.190** | [1.218,14.410] |
| job | 1.058 | [0.688,1.626] |
| incomegr2 | 1.408* | [0.953,2.082] |
| incomegr3 | 1.602** | [1.002,2.563] |
| incomegr4 | 1.418 | [0.891,2.256] |
| incomegr5 | 1.737* | [0.967,3.121] |
| cleanwater | 0.736 | [0.449,1.206] |
| hygtoilet | 0.863 | [0.610,1.221] |
| htimegr | 0.452*** | [0.296,0.688] |
| nocover | 0.589** | [0.374,0.926] |
| city2 | 4.229*** | [2.160,8.279] |
| city3 | 0.701 | [0.390,1.260] |
| city4 | 5.915*** | [1.735,20.164] |
| city5 | 2.155** | [1.137,4.082] |
| city6 | 0.225*** | [0.108,0.472] |
| city7 | 0.659 | [0.332,1.307] |
| city8 | 3.944*** | [2.065,7.532] |
| city9 | 3.060*** | [1.332,7.030] |
| Constant | 0.274 | [0.056,1.343] |
方法2:
Mixed-effects logistic regression Number of obs = 1051
Group variable: citynum Number of groups= 9
Obs per group: min = 28
avg = 116.8
max = 192
Integration method: mvaghermite Integration points = 7
Wald chi2(22) = 84.73
Log likelihood = -599.58982 Prob > chi2 = 0.0000
| b | ci95 |
| survey2 | 1.333 | [0.874,2.034] |
| urban | 5.610*** | [2.784,11.305] |
| turban | 0.218*** | [0.106,0.446] |
| biragegr2 | 1.772 | [0.778,4.037] |
| biragegr3 | 1.832 | [0.800,4.196] |
| biragegr4 | 1.414 | [0.548,3.645] |
| pone | 1.405* | [0.945,2.089] |
| ncd | 0.834 | [0.373,1.867] |
| hanmajor | 0.447*** | [0.244,0.816] |
| marital | 1.719 | [0.490,6.030] |
| edugr2 | 3.051** | [1.013,9.188] |
| edugr3 | 3.161* | [0.956,10.457] |
| edugr4 | 4.019** | [1.173,13.771] |
| job | 1.057 | [0.689,1.622] |
| incomegr2 | 1.392* | [0.944,2.052] |
| incomegr3 | 1.568* | [0.983,2.501] |
| incomegr4 | 1.406 | [0.886,2.232] |
| incomegr5 | 1.741* | [0.972,3.120] |
| cleanwater | 0.754 | [0.463,1.229] |
| hygtoilet | 0.85 | [0.602,1.201] |
| htimegr | 0.458*** | [0.302,0.694] |
| nocover | 0.582** | [0.371,0.912] |
| Constant | 0.441 | [0.083,2.353] |
| var(_cons[citynum]) | 0.945095 | |
| Constant | 2.573* | [0.976,6.781] |
LR test vs. logistic regression: chibar2(01) = 104.93 Prob>=chibar2 = 0.0000
问题:
1)上面两种方法得到的结果比较相近,其主要的区别是不是方法1其实把地区的效应固定了,方法2把地区作为随机效应控制了?
2)从这个结果来看的话,具体该怎么判断哪种方法更加合适呢?