为什么OLS回归的sklearn和statsmodels实现给出了不同的R平方

4114

收藏 2021-04-03

sklearn和statsmodels实现的OLS模型在不拟合截距时会产生不同的R平方值。否则他们似乎工作得很好。以下代码

import numpy as np
import sklearn
import statsmodels
import sklearn.linear_model as sl
import statsmodels.api as sm

np.random.seed(42)

N=1000
X = np.random.normal(loc=1, size=(N, 1))
Y = 2 * X.flatten() + 4 + np.random.normal(size=N)

sklernIntercept=sl.LinearRegression(fit_intercept=True).fit(X, Y)
sklernNoIntercept=sl.LinearRegression(fit_intercept=False).fit(X, Y)
statsmodelsIntercept = sm.OLS(Y, sm.add_constant(X))
statsmodelsNoIntercept = sm.OLS(Y, X)

print(sklernIntercept.score(X, Y), statsmodelsIntercept.fit().rsquared)
print(sklernNoIntercept.score(X, Y), statsmodelsNoIntercept.fit().rsquared)

print(sklearn.__version__, statsmodels.__version__)

差异从何而来？

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

yunnandlg

2021-4-6 13:55:20

The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).R2可能为负。当模型的预测使数据拟合度比输出值的平均值差时，就会出现负分数。
score(X, y[,]sample_weight) 定义为(1-u/v)，
u = （（y_true - y_pred）**2）.sum()，
v=((y_true-y_true.mean())**2).mean()
最好的得分为1.0，一般的得分都比1.0低，得分越低代表结果越差。

SST=SSE+SSR 在没有截距项的回归模型中，该等式不成立。不带截距项的线性回归的R^2会小于0或者大于1。但此时我们可用Uncentered R-square。
【1】sklearn计算出的score（r2）是严格按照公式计算。
【2】statsmodels计算出的r2 在没有截距时R2 is computed without centering (uncentered) since the model does not contain a constant.

rsquared – R-squared of a model with an intercept. This is defined here as 1 - ssr/centered_tss if the constant is included in the model and 1 - ssr/uncentered_tss if the constant is omitted.
rsquared_adj – Adjusted R-squared. This is defined here as 1 - (nobs-1)/df_resid * (1-rsquared) if a constant is included and 1 - nobs/df_resid * (1-rsquared) if no constant is included.

【3】特别说明：centered/uncentered R2与 R2/ adjusted R2 不是一个概念，而一般的教科书也提醒我们，使用 IV 估计时，R2 是没有太大意义的（所以通常不报告 R2 值）！

可以参考 https://www.stata-journal.com/sjpdf.html?articlenum=st0030

Regression through the origin is an important and useful tool in applied statistics, but it remains a subject of pedagogical neglect, controversy and confusion. Hopefully, this synthesis provides some clarity. However, in the light ofthe unresolved debate, perhaps the strongest conclusion to be drawn from this review is that the practice of statistics remains as much an art asit is a science, and the development of statistical judgment is therefore as important as computational skill.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群

扫码加我拉你入群