摘要翻译:
我们考虑通过使用从不同策略获得的历史数据来评估和训练一个新的评估数据策略。非策略评估(OPE)的目标是估计一个新策略对评估数据的期望回报,而非策略学习(OPL)的目标是寻找一个新策略对评估数据的期望回报最大化。虽然标准的OPE和OPL假设历史数据和评价数据的协变量分布相同,但往往存在协变量偏移,即历史数据的协变量分布不同于评价数据的协变量分布。在本文中,我们导出了协变量移位下OPE的效率界。然后,我们利用历史数据分布和评价数据分布之间密度比的非参数估计,提出了协变量移位下OPE和OPL的双鲁棒有效估计。我们还讨论了其他可能的估计量,并比较了它们的理论性质。最后,我们通过实验验证了所提出的估计量的有效性。
---
英文标题:
《Off-Policy Evaluation and Learning for External Validity under a
Covariate Shift》
---
作者:
Masahiro Kato, Masatoshi Uehara, Shota Yasui
---
最新提交年份:
2020
---
分类信息:
一级分类:Statistics 统计学
二级分类:Machine Learning
机器学习
分类描述:Covers machine learning papers (supervised, unsupervised, semi-supervised learning, graphical models, reinforcement learning, bandits, high dimensional inference, etc.) with a statistical or theoretical grounding
覆盖机器学习论文(监督,无监督,半监督学习,图形模型,强化学习,强盗,高维推理等)与统计或理论基础
--
一级分类:Computer Science 计算机科学
二级分类:Machine Learning 机器学习
分类描述:Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.
关于机器学习研究的所有方面的论文(有监督的,无监督的,强化学习,强盗问题,等等),包括健壮性,解释性,公平性和方法论。对于机器学习方法的应用,CS.LG也是一个合适的主要类别。
--
一级分类:Economics 经济学
二级分类:Econometrics 计量经济学
分类描述:Econometric Theory, Micro-Econometrics, Macro-Econometrics, Empirical Content of Economic Relations discovered via New Methods, Methodological Aspects of the Application of Statistical Inference to Economic Data.
计量经济学理论,微观计量经济学,宏观计量经济学,通过新方法发现的经济关系的实证内容,统计推论应用于经济数据的方法论方面。
--
---
英文摘要:
We consider evaluating and training a new policy for the evaluation data by using the historical data obtained from a different policy. The goal of off-policy evaluation (OPE) is to estimate the expected reward of a new policy over the evaluation data, and that of off-policy learning (OPL) is to find a new policy that maximizes the expected reward over the evaluation data. Although the standard OPE and OPL assume the same distribution of covariate between the historical and evaluation data, a covariate shift often exists, i.e., the distribution of the covariate of the historical data is different from that of the evaluation data. In this paper, we derive the efficiency bound of OPE under a covariate shift. Then, we propose doubly robust and efficient estimators for OPE and OPL under a covariate shift by using a nonparametric estimator of the density ratio between the historical and evaluation data distributions. We also discuss other possible estimators and compare their theoretical properties. Finally, we confirm the effectiveness of the proposed estimators through experiments.
---
PDF链接:
https://arxiv.org/pdf/2002.11642