Title | Computing the Chow statistic | |
Author | William Gould, StataCorp | |
Date | January 1999; minor revisions July 2005 |
Let’s start with the Chow test to which many refer. Consider the model
y = a + b*x1 + c*x2 + uand say that we have two groups of data. We could estimate that model on the two groups separately:
y = a1 + b1*x1 + c1*x2 + u for group == 1 y = a2 + b2*x1 + c2*x2 + u for group == 2and we could estimate a single, pooled regression
y = a + b*x1 + c*x2 + u for both groupsIn the last regression, we are asserting that a1==a2, b1==b2, and c1==c2. The formula for the “Chow test” of this constraint is
ess_c - (ess_1+ess_2) --------------------- k --------------------------------- ess_1 + ess_2 --------------- N_1 + N_2 - 2*kand this is the formula to which people refer. ess_1 and ess_2 are the error sum of squares from the separate regressions, ess_c is the error sum of squares from the pooled (constrained) regression, k is the number or estimated parameters (k=3 in our case), and N_1 and N_2 are the number of observations in the two groups.
The resulting test statistic is distributed F(k, N_1+N_2-2*k).
Let’s try this. I have created small datasets:
clear set obs 100 set seed 1234 generate x1 = uniform() generate x2 = uniform() generate y = 4*x1 - 2*x2 + 2*invnormal(uniform()) generate group = 1 save one, replace clear set obs 80 generate x1 = uniform() generate x2 = uniform() generate y = -2*x1 + 3*x2 + 8*invnormal(uniform()) generate group = 2 save two, replace use one, clear append using two save combined, replaceThe models are different in the two groups, the residual variances are different, and so are the number of observations. With this dataset, I can carry forth the Chow test. First, I run the separate regressions:
. regress y x1 x2 if group==1 Source | SS df MS Number of obs = 100 ---------+------------------------------ F( 2, 97) = 36.10 Model | 328.686307 2 164.343154 Prob > F = 0.0000 Residual | 441.589627 97 4.55247038 R-squared = 0.4267 ---------+------------------------------ Adj R-squared = 0.4149 Total | 770.275934 99 7.78056499 Root MSE = 2.1337 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- x1 | 5.121087 .728493 7.03 0.000 3.67523 6.566944 x2 | -3.227026 .7388209 -4.37 0.000 -4.693381 -1.760671 _cons | -.1725655 .5698273 -0.30 0.763 -1.303515 .9583839 ------------------------------------------------------------------------------ . regress y x1 x2 if group==2 Source | SS df MS Number of obs = 80 ---------+------------------------------ F( 2, 77) = 5.02 Model | 544.11726 2 272.05863 Prob > F = 0.0089 Residual | 4169.24211 77 54.1460014 R-squared = 0.1154 ---------+------------------------------ Adj R-squared = 0.0925 Total | 4713.35937 79 59.6627768 Root MSE = 7.3584 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- x1 | -1.21464 2.9578 -0.41 0.682 -7.104372 4.675092 x2 | 8.49714 2.688249 3.16 0.002 3.144152 13.85013 _cons | -2.2591 1.91076 -1.18 0.241 -6.06391 1.545709 ------------------------------------------------------------------------------and then I run the combined regression:
. regress y x1 x2 Source | SS df MS Number of obs = 180 ---------+------------------------------ F( 2, 177) = 2.93 Model | 176.150454 2 88.0752272 Prob > F = 0.0559 Residual | 5316.21341 177 30.035104 R-squared = 0.0321 ---------+------------------------------ Adj R-squared = 0.0211 Total | 5492.36386 179 30.683597 Root MSE = 5.4804 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- x1 | 2.692373 1.41842 1.90 0.059 -.1068176 5.491563 x2 | 2.061004 1.370448 1.50 0.134 -.6435156 4.765524 _cons | -1.380331 1.017322 -1.36 0.177 -3.387973 .62731 ------------------------------------------------------------------------------For the Chow test,
ess_c - (ess_1+ess_2) --------------------- k --------------------------------- ess_1 + ess_2 --------------- N_1 + N_2 - 2*khere are the relevant numbers copied from the output above:
ess_c = 5316.21341 (from combined regression) ess_1 = 441.589627 (from group==1 regression) ess_2 = 4169.24211 (from group==2 regression) k = 3 (we estimate 3 parameters) N_1 = 100 (from group==1 regression) N_2 = 80 (from group==2 regression)So, plugging in, we get
5316.21341 - (441.589628+4169.24211) 705.38167 ------------------------------------ --------- 3 3 ----------------------------------------- = --------------- 441.589628 + 4169.24211 4610.8317 ----------------------- --------- 100+80-2*3 174 235.12722 = ---------- 26.499033 = 8.8730491The Chow test is F(k,N_1+N_2-2*k) = F(3,174), so our test statistic is F(3,174) = 8.8730491.
Now, I will do the same problem by running one regression and using test to test certain coefficients equal to zero. What I want to do is estimate the model
y = a3 + b3*x1 + c3*x2 + a3'*g2 + b3'*g2*x1 + c3'*g2*x2 + uwhere g2=1 if group==2 and g2=0 otherwise. I can do this by typing
. generate g2 = (group==2) . generate g2x1 = g2*x1 . generate g2x2 = g2*x2 . regress y x1 x2 g2 g2x1 g2x2Think about the predictions from this model. The model says
y = a3 + b3*x1 + c3*x2 + u when g2==0 y = (a3+a3') + (b3+b3')*x1 + (c3+c3')*x2 + u when g2==1Thus the model is equivalent to estimating the separate models
y = a1 + b1*x1 + c1*x2 + u for group == 1 y = a2 + b2*x1 + c2*x2 + u for group == 2the relationship being
a1 = a3 a2 = a3 + a3' b1 = b3 b2 = b3 + b3' c1 = c3 c2 = c3 + c3'Some of you may be concerned that in the pooled model (the one estimating a3, b3, etc.), we are constraining the var(u) to be the same for each group, whereas, in the separate-equation model, we estimate different variances for group 1 and group 2. This does not matter, because the model is fully interacted. That is probably not convincing, but what should be convincing is that I am about to obtain the same F(3,174) = 8.87 answer and, in my concocted data, I have different variances in each group.
So, here is the result of the alternative test coeffiecients against 0 in a pooled specification:
. generate g2 = (group==2) . generate g2x1 = g2*x1 . generate g2x2 = g2*x2 . regress y x1 x2 g2 g2x1 g2x2 Source | SS df MS Number of obs = 180 ---------+------------------------------ F( 5, 174) = 6.65 Model | 881.532123 5 176.306425 Prob > F = 0.0000 Residual | 4610.83174 174 26.499033 R-squared = 0.1605 ---------+------------------------------ Adj R-squared = 0.1364 Total | 5492.36386 179 30.683597 Root MSE = 5.1477 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- x1 | 5.121087 1.757587 2.91 0.004 1.652152 8.590021 x2 | -3.227026 1.782504 -1.81 0.072 -6.745139 .2910877 g2 | -2.086535 1.917507 -1.09 0.278 -5.871102 1.698032 g2x1 | -6.335727 2.714897 -2.33 0.021 -11.6941 -.9773583 g2x2 | 11.72417 2.59115 4.52 0.000 6.610035 16.8383 _cons | -.1725655 1.374785 -0.13 0.900 -2.885966 2.540835 ------------------------------------------------------------------------------ . test g2 g2x1 g2x2 ( 1) g2 = 0 ( 2) g2x1 = 0 ( 3) g2x2 = 0 F( 3, 174) = 8.87 Prob > F = 0.0000Same answer.
This definition of the “Chow test” is equivalent to pooling the data, estimating the fully interacted model, and then testing the group 2 coefficients against 0.
That is why I said, “Chow Test is a term I have heard used by economists in the context of testing a set of regression coefficients being equal to 0.”
Admittedly, that leaves a lot unsaid.
The issue of the variance of u being equal in the two groups is subtle, but I do not want that to get in the way of understanding that the Chow test is equivalent to the “pool the data, interact, and test” procedure. They are equivalent.
Concerning variances, the Chow test itself is testing against a pooled, uninteracted model and so has buried in it an assumption of equal variances. It is really a test that the coefficients are equal and variance(u) in the groups are equal. It is, however, a weak test of the equality of variances because that assumption manifests itself only in how the pooled coefficient estimates are manufactured. Since the Chow test and the “pool the data, interact, and test” procedure are the same, the same is true of both procedures.
Your second concern might be that in the “pool the data, interact, and test” procedure there is an extra assumption of equality of variances because everything comes from the pooled model. As shown, that is not true. It is not true because the model is fully interacted and so the assumption of equal variances never makes a difference in the calculation of the coefficients.
http://www.stata.com/support/faqs/stat/chow2.html
Title | Chow and Wald tests | |
Author | William Gould, StataCorp | |
Date | July 1999; minor revision August 2007 |
First, see the FAQ How can I compute a Chow test statistic?. The point of that FAQ is that you can do Chow tests using Stata’s test command and, in fact, Chow tests are what the test command reports.
Well, that’s not exactly right. test uses the estimated variance–covariance matrix of the estimators, and test performs Wald tests,
W = (Rb-r)'(RVR')-1 (Rb-r)
where V is the estimated variance–covariance matrix of the estimators.
For linear regression with the conventionally estimated V, the Wald test is the Chow test and vice versa.
You might say that you are performing a Chow test, but I say that you are performing a Wald test. That distinction is important, because the Wald test generalizes to different variance estimates of V, whereas the Chow test does not. After regress, vce(robust), for instance, test uses the V matrix estimated by the robust method because that is what regress, vce(robust) left behind.
Thus the short answer is that you estimate your model using regress, vce(robust) and then use Stata’s test command. You then call the result a Wald test.
If you are bothered that a Wald test produces F rather than chi-squared statistics, also see the FAQ Why does test sometimes produce chi-squared and other times F statistics?
Title | Chow tests | |
Author | William Gould, StataCorp | |
Date | January 2002; updated August 2005 |
In the past, I have always given in and cast my answer in Chow-test terms. In this reply, I try a different approach and, I think, the result is more useful.
This reply concerns linear regression (though the technique is really more general than that), and I gloss over the detail of pooling the residuals and whether the residual variances are really the same. For the last, I think I can be forgiven.
Here is what I wrote:
Is a Chow test the correct test to determine whether data can be pooled together? |
History: In the days when statistical packages were not as sophisticated as they are now, testing whether coefficients were equal was not so easy. You had to write your own program, typically in FORTRAN. Chow showed a way you could perform a Wald test based on statistics that were commonly reported, and that would produce the same result as if you performed the Wald test. |
What does it mean “whether data can be pooled together”? Do you often meet nonprofessionals who say to you, “I was wondering whether the data could be pooled?” Forget that phrase, too: it is another piece of jargon for testing whether the behavior is the same, as measured by whether the coefficients are the same.
Let’s pretend that you have some model and two or more groups of data. Your model predicts something about the behavior within the group based on certain characteristics that vary within the group. Under the assumption that each group's behavior is unique, you have
y_1 = X_1*b_1 + u_1 (equation for group 1) y_2 = X_2*b_2 + u_2 (equation for group 2)and so on. Now, you want to test whether the behavior for one group is the same as for another, which means you want to test
b_1 = b_2 = ...How do you do that? Testing coefficients across separately estimated models is difficult to impossible, depending on things we need not go into right now. A trick is to “pool” the data to convert the multiple equations into one giant equation:
y = d1*(X_1*b1 + u1) + d2*(X_2*b2 + u2) + ...where y is the set of all outcomes (y_1, y_2, ...), and d1 is a variable that is 1 when the data are for group 1 and 0 otherwise, d2 is 1 when the data are for group 2 and 0 otherwise, ....
Notice that from the above I can retrieve the original equations. Setting d1=1 and d2=d3=...=0, I get the equation for group 1; setting d1=0 and d2=1 and d3=...=0, I get the equation for group 2; and so on.
Now, let’s start with
y = d1*(X_1*b1 + u1) + d2*(X_2*b2 + u2) + ...and rewrite it by a little algebraic manipulation:
y = d1*(X_1*b1 + u1) + d2*(X_2*b2 + u2) + ... = d1*X_1*b1 + d1*u2 + d2*X_2*b2 + d2*u2 + ... = d1*X_1*b1 + d2*X_2*b2 + ... + d1*u1 + d2*u2 + ... = X_1*d1*b1 + X_2*d2*b2 + ... + d1*u1 + d2*u2 + ... = (X_1*d1)*b1 + (X_2*d2)*b2 + ... + d1*u1 + d2*u2 + ...By stacking the data, I can get back estimates of b1, b2, ...
I include not X_1 in my model, but X_1*d1 (a set of variables equal to X_1 when group is 1 and 0 otherwise); I include not X_2 in my model, but X_2*d2 (a set of variables equal to X_2 when group is 2 and 0 otherwise); and so on.
Let’s use the auto dataset and pretend that I have two groups.
. sysuse auto,clear . generate group1=rep78==3 . generate group2=group1==0I could fit the models separately:
. regress price mpg weight if group1==1 Source | SS df MS Number of obs = 30 -------------+------------------------------ F( 2, 27) = 16.20 Model | 196545318 2 98272658.8 Prob > F = 0.0000 Residual | 163826398 27 6067644.36 R-squared = 0.5454 -------------+------------------------------ Adj R-squared = 0.5117 Total | 360371715 29 12426610.9 Root MSE = 2463.3 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | 13.14912 184.5661 0.07 0.944 -365.5492 391.8474 weight | 3.517687 1.015855 3.46 0.002 1.433324 5.60205 _cons | -5431.147 6599.898 -0.82 0.418 -18973.02 8110.725 ------------------------------------------------------------------------------ . regress price mpg weight if group2==1 Source | SS df MS Number of obs = 44 -------------+------------------------------ F( 2, 41) = 5.16 Model | 54562909.6 2 27281454.8 Prob > F = 0.0100 Residual | 216614915 41 5283290.61 R-squared = 0.2012 -------------+------------------------------ Adj R-squared = 0.1622 Total | 271177825 43 6306461.04 Root MSE = 2298.5 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -170.5474 93.3656 -1.83 0.075 -359.103 18.0083 weight | .0527381 .8064713 0.07 0.948 -1.575964 1.68144 _cons | 9685.028 4190.693 2.31 0.026 1221.752 18148.3 ------------------------------------------------------------------------------I could fit the combined model:
. generate mpg1=mpg*group1 . generate weight1=weight*group1 . generate mpg2=mpg*group2 . generate weight2=weight*group2 . regress price group1 mpg1 weight1 group2 mpg2 weight2, noconstant Source | SS df MS Number of obs = 74 -------------+------------------------------ F( 6, 68) = 91.38 Model | 3.0674e+09 6 511232168 Prob > F = 0.0000 Residual | 380441313 68 5594725.19 R-squared = 0.8897 -------------+------------------------------ Adj R-squared = 0.8799 Total | 3.4478e+09 74 46592355.7 Root MSE = 2365.3 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- group1 | -5431.147 6337.479 -0.86 0.394 -18077.39 7215.096 mpg1 | 13.14912 177.2275 0.07 0.941 -340.5029 366.8012 weight1 | 3.517687 .9754638 3.61 0.001 1.571179 5.464194 group2 | 9685.028 4312.439 2.25 0.028 1079.69 18290.37 mpg2 | -170.5474 96.07802 -1.78 0.080 -362.2681 21.17334 weight2 | .0527381 .8299005 0.06 0.950 -1.603303 1.708779 ------------------------------------------------------------------------------What is this noconstant option? We must remember that when we fit the separate models, each has its own intercept. There was an intercept in X_1, X_2, and so on. What I have done above is literally translate
y = (X_1*d1)*b1 + (X_2*d2)*b2 + d1*u1 + d2*u2and so included the variables group1 and group2 (variables equal to 1 for their respective groups) and told Stata to omit the overall intercept.
I do not recommend you fit the model the way I have just illustrated because of numerical concerns—we'll get to that later. Fit the models separately or jointly, and you will get the same estimates for b_1 and b_2.
Now we can test whether the coefficients are the same for the two groups:
. test _b[mpg1]=_b[mpg2], notest ( 1) mpg1 - mpg2 = 0 . test _b[weight1]=_b[weight2], accum ( 1) mpg1 - mpg2 = 0 ( 2) weight1 - weight2 = 0 F( 2, 68) = 5.61 Prob > F = 0.0056That is the Chow test. Something was omitted: the intercept. If we really wanted to test whether the two groups were the same, we would would test
. test _b[mpg1]=_b[mpg2] ( 1) mpg1 - mpg2 = 0 . test _b[weight1]=_b[weight2], accum ( 1) mpg1 - mpg2 = 0 ( 2) weight1 - weight2 = 0 . test _b[group1]=_b[group2], accum ( 1) mpg1 - mpg2 = 0 ( 2) weight1 - weight2 = 0 ( 3) group1 - group2 = 0 F( 3, 68) = 4.07 Prob > F = 0.0102Using this approach, however, we are not tied down by what the "Chow test" can test. We can formulate any hypothesis we want. We might think that weight works the same way in both groups but that mpg works differently, and each group has its own intercept. Then, we could test
. test _b[mpg1]=_b[mpg2] ( 1) mpg1 - mpg2 = 0 F( 1, 68) = 0.83 Prob > F = 0.3654by itself. If we had more variables, we could test any subset of variables.
Is “pooling the data” justified? Of course it is: we just established that pooling the data is just another way of fitting separate models and that fitting separate models is certainly justified—we got the same coefficients. That’s why I told you to forget the phrase about whether pooling the data is justified. People who ask that don’t really mean to ask what they are saying: they mean to ask whether the coefficients are the same. In that case, they should say that. Pooling is always justified, and it corresponds to nothing more than the mathematical trick of writing separate equations,
y_1 = X_1*b_1 + u_1 (equation for group 1) y_2 = X_2*b_2 + u_2 (equation for group 2)as one equation
y = (X_1*d1)*b1 + (X_2*d2)*b2 + d1*u1 + d2*u2There are many ways I can write the above equation, and I want to write it a little differently because of numerical concerns. Starting with
y = (X_1*d1)*b1 + (X_2*d2)*b2 + d1*u1 + d2*u2let’s do some algebra to obtain
y = X*b1 + d2*X_2*(b2-b1) + d1*y1 + d2*u2where X = (X_1, X_2). In this formulation, I measure not b1 and b2, but b1 and (b2−b1). This is numerically more stable, and I can still test that b2==b1 by testing whether (b2−b1)=0. Let’s fit this model
. regress price mpg weight mpg2 weight2 group2 Source | SS df MS Number of obs = 74 -------------+------------------------------ F( 5, 68) = 9.10 Model | 254624083 5 50924816.7 Prob > F = 0.0000 Residual | 380441313 68 5594725.19 R-squared = 0.4009 -------------+------------------------------ Adj R-squared = 0.3569 Total | 635065396 73 8699525.97 Root MSE = 2365.3 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | 13.14912 177.2275 0.07 0.941 -340.5029 366.8012 weight | 3.517687 .9754638 3.61 0.001 1.571179 5.464194 mpg2 | -183.6965 201.5951 -0.91 0.365 -585.9733 218.5803 weight2 | -3.464949 1.280728 -2.71 0.009 -6.020602 -.9092956 group2 | 15116.17 7665.557 1.97 0.053 -180.2075 30412.56 _cons | -5431.147 6337.479 -0.86 0.394 -18077.39 7215.096 ------------------------------------------------------------------------------and, if I want to test whether the coefficients are the same, I can do
. test _b[mpg2]=0 ( 1) mpg2 = 0 . test _b[weight2]=0, accum ( 1) mpg2 = 0 ( 2) weight2 = 0 F( 2, 68) = 5.61 Prob > F = 0.0056and that gives the same answer yet again. If I want to test whether *ALL* the coefficients are the same (including the intercept) I can use
. test _b[mpg2]=0, notest ( 1) mpg2 = 0 . test _b[weight2]=0, accum notest ( 1) mpg2 = 0 ( 2) weight2 = 0 . test _b[group2]=0, accum ( 1) mpg2 = 0 ( 2) weight2 = 0 ( 3) group2 = 0 F( 3, 68) = 4.07 Prob > F = 0.0102Just as before, I can test any subset.
Using this difference formulation, if I had three groups, starting with
y = (X_1*d1)*b1 + (X_2*d2)*b2 + (X_3*d3)*b3 + d1*u1 + d2*u2 + d3*u3as
y = X*b1 + (X_2*d2)*(b2-b1) + (X_3*d3)*(b3-b1) + d1*u1 + d2*u2 + d3*u3Let’s create the group variables and fit this model:
. sysuse auto,clear . generate group1=rep78==3 . generate group2=rep78==4 . generate group3=(group1+group2)==0 . generate mpg1=mpg*group1 . generate weight1=weight*group1 . generate mpg2=mpg*group2 . generate weight2=weight*group2 . generate mpg3=mpg*group3 . generate weight3=weight*group3 . regress price mpg weight mpg2 weight2 group2 /// > mpg3 weight3 group3 Source | SS df MS Number of obs = 74 -------------+------------------------------ F( 8, 65) = 5.80 Model | 264415585 8 33051948.1 Prob > F = 0.0000 Residual | 370649811 65 5702304.78 R-squared = 0.4164 -------------+------------------------------ Adj R-squared = 0.3445 Total | 635065396 73 8699525.97 Root MSE = 2387.9 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | 13.14912 178.9234 0.07 0.942 -344.1855 370.4837 weight | 3.517687 .9847976 3.57 0.001 1.55091 5.484463 mpg2 | 130.5261 336.6547 0.39 0.699 -541.8198 802.872 weight2 | -2.18337 1.837314 -1.19 0.239 -5.85274 1.486 group2 | 4560.193 12222.22 0.37 0.710 -19849.27 28969.66 mpg3 | -194.1974 216.3459 -0.90 0.373 -626.27 237.8752 weight3 | -3.160952 1.73308 -1.82 0.073 -6.622152 .3002481 group3 | 14556.66 9167.998 1.59 0.117 -3753.101 32866.41 _cons | -5431.147 6398.12 -0.85 0.399 -18209.07 7346.781 ------------------------------------------------------------------------------If I want to test whether the three groups were the same in the Wald-test sense, I can use
. test (_b[mpg2]=0) (_b[weight2]=0) (_b[group2]=0) /* > */ (_b[mpg3]=0) (_b[weight3]=0) (_b[group3]=0) ( 1) mpg2 = 0 ( 2) weight2 = 0 ( 3) group2 = 0 ( 4) mpg3 = 0 ( 5) weight3 = 0 ( 6) group3 = 0 F( 6, 65) = 2.28 Prob > F = 0.0463which I could more easily type as
. testparm mpg2 weight2 group2 mpg3 weight3 group3 ( 1) mpg2 = 0 ( 2) weight2 = 0 ( 3) group2 = 0 ( 4) mpg3 = 0 ( 5) weight3 = 0 ( 6) group3 = 0 F( 6, 65) = 2.28 Prob > F = 0.0463
Title | Pooling data and performing Chow tests in linear regression | |
Author | William Gould, StataCorp | |
Date | December 1999; updated August 2005 |
and let us pretend that we have two groups of data, group=1 and group=2. We could have more groups; everything said below generalizes to more than two groups.
We could estimate the models separately by typing
. regress y x1 x2 if group==1 and . regress y x1 x2 if group==2or we could pool the data and estimate a single model, one way being
. gen g2 = (group==2) . gen g2x1 = g2*x1 . gen g2x2 = g2*x2 . regress y x1 x2 g2 g2x1 g2x2The difference between these two approaches is that we are constraining the variance of the residual to be the same in the two groups when we pool the data. When we estimated separately, we estimated
When we pooled the data, we estimated
If we evaluate this equation for the groups separately, we obtain
The difference is that we have now constrained the variance of u for group=1 to be the same as the variance of u for group=2.
If you perform this experiment with real data, you will observe the following:
If u is known to have the same variance in the two groups, the standard errors obtained from the pooled regression are better—they are more efficient. If the variances really are different, however, then the standard errors obtained from the pooled regression are wrong.
I have created a dataset (containing made-up data) on y, x1, and x2. The dataset has 74 observations for group=1 and another 71 observations for group=2. Using these data, I can run the regressions separately by typing
[1] . regress y x1 x2 if group==1 [2] . regress y x1 x2 if group==2or I can run the pooled model by typing
. gen g2 = (group==2) . gen g2x1 = g2*x1 . gen g2x2 = g2*x2 [3] . regress y x1 x2 g2 g2x1 g2x2I did that in Stata and let me summarize the results. When I typed command [1], I obtained the following results (standard errors in parentheses):
y = -8.650993 + 1.21329*x1 + -.8809939*x2 + u, Var(u) = 15.8912 (22.73703) (.54459) (.405401)and when I ran command [2], I obtained
y = 4.646994 + .9307004*x1 + .8812369*x2 + u, Var(u) = 7.56852 (11.1593) (.236696) (.1997562)When I ran command [3], I obtained
y = -8.650993 + 1.21329*x1 + -.8809939*x2 + (17.92853) (.42942) + (.3196656) 13.29779*g2 + -.2825893*g2x1 + 1.762231*g2x2 + u, Var(u) = 12.5312 (25.74446) (.6123452) (.459958)The intercept and coefficients on x1 and x2 in [3] are the same as in [1], but the standard errors are different. Also, if I sum the appropriate coefficients in [3], I obtain the same results as [2]:
Intercept: 13.29779 + -8.650993 = 4.646797 ([2] has 4.646994) x1: -.2825893 + 1.21329 = .9307004 ([2] has .9307004) x2: 1.762231 + -.8809939 = .8812371 ([2] has .8812369)The coefficients are the same, estimated either way. (The fact that the coefficients in [3] are a little off from those in [2] is just because I did not write down enough digits.)
The standard errors for the coefficients are different.
I also wrote down the estimated Var(u), what is reported as RMSE in Stata’s regression output. In standard deviation terms, u has s.d. 15.891 in group=1, 7.5685 in group=2, and if we constrain these two very different numbers to be the same, the pooled s.d. is 12.531.
. gen g2 = (group==2) . gen g2x1 = g2*x1 . gen g2x2 = g2*x2 . regress y x1 x2 g2 g2x1 g2x2and we start exactly the same way. To that, we add
. predict r, resid . sum r if group==1 . gen w = r(Var)*(r(N)-1)/(r(N)-3) if group==1 . sum r if group==2 . replace w = r(Var)*(r(N)-1)/(r(N)-3) if group==2 [4] . regress y x1 x2 g2 g2x1 g2x2 [aw=1/w]In the above, the constant 3 that appears twice is 3 because there were three coefficients being estimated in each group (an intercept, a coefficient for x1, and a coefficient for x2). If there were a different number of coefficients being estimated, that number would change.
In any case, this will reproduce exactly the standard errors reported by estimating the two models separately. The advantage is that we can now test equality of coefficients between the two equations. For instance, we can now read right off the pooled regression results whether the effect of x1 is the same in groups 1 and 2 (answer: is _b[g2x1]==0?, because _b[x1] is the effect in group 1 and _b[x1]+_b[g2x1] is the effect in group 2, so the difference is _b[g2x1]). And, using test, we can test other constraints as well.
For instance, if you wanted to prove to yourself that the results of [4] are the same as typing regress y x1 x2 if group==2, you could type
. test x1 + g2x1 == 0 (reproduces test of x1 for group==2) and . test x2 + g2x2 == 0 (reproduces test of x2 for group==2)
Using the made-up data, I did exactly that. To recap, first I estimated separate regressions:
[1] . regress y x1 x2 if group==1 [2] . regress y x1 x2 if group==2and then I ran the variance-constrained regression,
. gen g2 = (group==2) . gen g2x1 = g2*x1 . gen g2x2 = g2*x2 [3] . regress y x1 x2 g2 g2x1 g2x2and then I ran the variance-unconstrained regression,
. predict r, resid . sum r if group==1 . gen w = r(Var)*(r(N)-1)/(r(N)-3) if group==1 . sum r if group==2 . replace w = r(Var)*(r(N)-1)/(r(N)-3) if group==2 [4] . regress y x1 x2 g2 g2x1 g2x2 [aw=1/w]Just to remind you, here is what commands [1] and [2] reported:
y = -8.650993 + 1.21329*x1 + -.8809939*x2 + u, Var(u) = 15.8912 (22.73703) (.54459) (.405401) y = 4.646994 + .9307004*x1 + .8812369*x2 + u, Var(u) = 7.56852 (11.1593) (.236696) (.1997562)Here is what command [4] reported:
y = -8.650993 + 1.21329*x1 + -.8809939*x2 + (22.73703) (.54459) (.405401) 13.29779*g2 + -.2825893*g2x1 + 1.762231*g2x2 + u (25.3269) (.6050657) (.451943)Those results are the same as [1] and [2]. (Pay no attention to the RMSE reported by regress at this last step; the reported RMSE is the standard deviation of neither of the two groups but is instead a weighted average; see the FAQ on this if you care. If you want to know the standard errors of the respective residuals, look back at the output from the summarize statements typed when producing the weighting variable.)
Technical Note: Note that in creating the weights, we typed . sum r if group==1 . gen w = r(Var)*(r(N)-1)/(r(N)-3) if group==1and similarly for group 2. The 3 that appears in the finite-sample normalization factor (r(N)-1)/(r(N)-3) appears because there are three coefficients per group being estimated. If our model had fewer or more coefficients, that number would change. In fact, the finite-sample normalization factor changes results very little. In real work, I would have ignored it and just typed . sum r if group==1 . gen w = r(Var) if group==1unless the number of observations in one of the groups was very small. The normalization factor was included here so that [4] would produce the same results as [1] and [2]. |
[4] . regress y x1 x2 g2 g2x1 g2x2 [aw=1/w]we test whether group 2 is the same as group 1, we obtain
. test g2x1 g2x2 g2 ( 1) g2x1 = 0.0 ( 2) g2x2 = 0.0 ( 3) g2 = 0.0 F( 3, 139) = 307.50 Prob > F = 0.0000If instead we had constrained the variances to be the same, estimating the model using
[3] . regress y x1 x2 g2 g2x1 g2x2and then repeated the test, the reported F-statistic would be 300.81.
If there were more groups, and the variance differences were great among the groups, this could become more important.
Anyway, to estimate xtgls, panels(het), you pool the data just as always,
. gen g2 = (group==2) . gen g2x1 = g2*x1 . gen g2x2 = g2*x2and then type
[5] . xtgls y x1 x2 g2 g2x1 g2x2, panels(het) i(group)to estimate the model. The result of doing that with my fictional data is
y = -8.650993 + 1.21329*x1 + -.8809939*x2 + (22.27137) (.53344) (.397099) 13.29779*g2 + -.2825893*g2x1 + 1.762231*g2x2 + u (24.80488) (.5925734) (.442610)These are the same coefficients we have always seen.
The standard errors produced by xtgls, panels(het) here are about 2% smaller than those produced by [4] and in general will be a little smaller because xtgls, panels(het) is an asymptotically based estimator. The two estimators are asymptotically equivalent, however, and in fact quickly become identical. The only caution I would advise is not to use xtgls, panels(het) if the number of degrees of freedom (observations minus number of coefficients) is below 25 in any of the groups. Then, the weighted OLS approach [4] is better (and you should make the finite-sample adjustment described in the above technical note).
reg y x1-xn if g==1
scalar r1=e(rss)
scalar n1=e(N)
reg y x1-xn if g==2
scalar r2=e(rss)
scalar n2=e(N)
reg y x1-xn if g==1|g==2
scalar r=e(rss)
scalar k=e(df_m)+1
di 1-F(n1, n2, (r-r1-r2)*(n1+n2-2*k)/((r1+r2)*k))
[此贴子已经被作者于2009-4-30 2:39:49编辑过]
嘻嘻,蓝版还是一如既往地好。。。向你学习,踏踏实实。。。
你太谦虚了。
我也只是去stata主页看看而已。
这个帖子很好,让我进一步理解了chow检验。
顺便请教一下大家,stata里有直接的chow命令(之前看arlionn的贴子)。
chow命令的帮助里,它的例子是:
chow y x1 x2, chow(year>1975)
我的数据是2005年和2006年,所以我写的命令是chow y x1 x2, chow(year>2005)
可结果窗口显示:“program error: code follows on the same line as open brace”
请问这是怎么回事呢?
这个帖子很好,让我进一步理解了chow检验。
顺便请教一下大家,stata里有直接的chow命令(之前看arlionn的贴子)。
chow命令的帮助里,它的例子是:
chow y x1 x2, chow(year>1975)
我的数据是2005年和2006年,所以我写的命令是chow y x1 x2, chow(year>2005)
可结果窗口显示:“program error: code follows on the same line as open brace”
请问这是怎么回事呢?
[此贴子已经被作者于2009-5-2 13:55:28编辑过]
*设因变量是y,自变量是x1-xn,分组变量是g,有两组(g=1、2)。对x1-xn及截距项的联合F检验(原假设是两组系数全部分别相等)。
qui g grp=(g==1)
foreach i of var x1-xn{
qui g `i'a=`i'*(g==1)
qui g `i'b=`i'*(g==2)
}
qui qreg y x1a-grp
foreach i of var x1-xn{
qui test _b[`i'a]=_b[`i'b],a
}
n test _b[grp]=0,a
*分位数回归的检验有一个难点:stata似乎不允许进行无截距分位数回归。
*把其中的qreg换成reg,即成为普通的关于OLS回归的Chow检验。
[此贴子已经被作者于2009-6-1 16:25:02编辑过]