Stata Learning Module on Regression Diagnostics: Hetereoscedasticity
Please Note: Stata graph commands changed with version 8 and this page was developed before version 8 was released and uses Stata 7 graph commands. Please see How do I use version 7 graph commands in Stata version 8? for information on how to either run these Stata 7 graph commands in Stata version 8, or how you can covert these commands to use Stata 8 syntax.
This module will explore the regression diagnostics associated with data that is hetereoscedastic, that is has non-constant variance across the predicted values of y. We will use a file called hetsc.dat to illustrate these problems. The file contains 100 observations, and the variables case y x1 x2 x3 and x4. We will use x1 x2 and x3 as predictors and y as the dependent variable. Below we use the hetsc.dta file.
. use http://www.ats.ucla.edu/stat/stata/modules/reg/hetsc, clear
We try running a regression predicting y from x1 x2 and x3.
. regress y x1 x2 x3
Source | SS df MS Number of obs = 100---------+------------------------------ F( 3, 96) = 65.68 Model | 8933.72373 3 2977.90791 Prob > F = 0.0000Residual | 4352.46627 96 45.3381903 R-squared = 0.6724---------+------------------------------ Adj R-squared = 0.6622 Total | 13286.19 99 134.203939 Root MSE = 6.7334------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+-------------------------------------------------------------------- x1 | .2158539 .083724 2.578 0.011 .0496631 .3820447 x2 | .7559357 .086744 8.715 0.000 .5837503 .9281211 x3 | .3732164 .0591071 6.314 0.000 .2558898 .490543 _cons | 33.23969 .6758811 49.180 0.000 31.89807 34.5813------------------------------------------------------------------------------
We can use the hettest command to test for heteroscedasticity. The test indicates that the regression results are indeed heteroscedastic, so we need to further understand this problem and try to address it.
. hettest
Cook-Weisberg test for heteroscedasticity using fitted values of y H Constant variance chi2(1) = 21.30 Prob > chi2 = 0.0000
Looking at the rvfplot below that shows the residual by fitted (predicted) value, we can clearly see evidence for heteroscedasticity. The variability of the residuals at the left side of the graph is much smaller than the variability of the residuals at the right side of the graph.
. rvfplot
We will try to stabilize the variance by using a square root transformation, and then run the regression again.
. generate sqy = y^.5. regress sqy x1 x2 x3
Source | SS df MS Number of obs = 100---------+------------------------------ F( 3, 96) = 69.37 Model | 66.0040132 3 22.0013377 Prob > F = 0.0000Residual | 30.4489829 96 .317176905 R-squared = 0.6843---------+------------------------------ Adj R-squared = 0.6744 Total | 96.4529961 99 .974272688 Root MSE = .56318------------------------------------------------------------------------------ sqy | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+-------------------------------------------------------------------- x1 | .0170293 .0070027 2.432 0.017 .003129 .0309297 x2 | .0652379 .0072553 8.992 0.000 .0508362 .0796397 x3 | .0328274 .0049438 6.640 0.000 .0230141 .0426407 _cons | 5.682593 .0565313 100.521 0.000 5.570379 5.794807------------------------------------------------------------------------------
Using the hettest again, the chi-square value is somewhat reduced, but the test for heteroscedasticity is still quite significant. The square root transformation was not successful.
. hettest
Cook-Weisberg test for heteroscedasticity using fitted values of sqy H Constant variance chi2(1) = 13.06 Prob > chi2 = 0.0003
Looking at the rvfplot below indeed shows that the results are still heteroscedastic.
. rvfplot
We next try a natural log transformation, and run the regression.
. generate lny = ln(y). regress lny x1 x2 x3
Source | SS df MS Number of obs = 100---------+------------------------------ F( 3, 96) = 69.85 Model | 8.17710164 3 2.72570055 Prob > F = 0.0000Residual | 3.74606877 96 .03902155 R-squared = 0.6858---------+------------------------------ Adj R-squared = 0.6760 Total | 11.9231704 99 .120436065 Root MSE = .19754------------------------------------------------------------------------------ lny | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+-------------------------------------------------------------------- x1 | .0054677 .0024562 2.226 0.028 .0005921 .0103432 x2 | .0230303 .0025448 9.050 0.000 .0179788 .0280817 x3 | .0118223 .001734 6.818 0.000 .0083803 .0152643 _cons | 3.445503 .0198285 173.765 0.000 3.406144 3.484862------------------------------------------------------------------------------
We again try the hettest and the results are much improved, but the test is still significant.
. hettest
Cook-Weisberg test for heteroscedasticity using fitted values of lny H Constant variance chi2(1) = 5.60 Prob > chi2 = 0.0179
Below we see that the rvfplot does not look perfect, but it is much improved.
. rvfplot
Perhaps you might want to try a log (to the base 10) transformation. We show that below.
. generate log10y = log10(y). regress log10y x1 x2 x3
Source | SS df MS Number of obs = 100---------+------------------------------ F( 3, 96) = 69.85 Model | 1.54229722 3 .514099074 Prob > F = 0.0000Residual | .706552237 96 .007359919 R-squared = 0.6858---------+------------------------------ Adj R-squared = 0.6760 Total | 2.24884946 99 .022715651 Root MSE = .08579------------------------------------------------------------------------------ log10y | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+-------------------------------------------------------------------- x1 | .0023746 .0010667 2.226 0.028 .0002571 .004492 x2 | .0100019 .0011052 9.050 0.000 .0078081 .0121957 x3 | .0051344 .0007531 6.818 0.000 .0036395 .0066292 _cons | 1.496363 .0086114 173.765 0.000 1.479269 1.513456------------------------------------------------------------------------------
The results for the hettest are the same as before. Whether we chose a log to the base e or a log to the base 10, the effect in reducing heteroscedasticity (as measured by hettest) was the same.
. hettest
Cook-Weisberg test for heteroscedasticity using fitted values of log10y H Constant variance chi2(1) = 5.60 Prob > chi2 = 0.0179
While these results are not perfect, we will be content for now that this has substantially reduced the heteroscedasticity as compared to the original data.