双鲁棒政策评估与学习

667

收藏 2022-04-02

摘要翻译：
我们研究的决策环境中，奖励只被部分观察到，但可以建模为一个行动和一个观察到的环境的函数。这种被称为上下文强盗的设置包含了各种各样的应用程序，包括医疗保健政策和互联网广告。一项中心任务是根据包括背景、行动和收到的奖励在内的历史数据对新政策进行评估。关键的挑战是，过去的数据通常不能忠实地代表一项新政策采取的行动的比例。以前的方法要么依赖于奖励模型，要么依赖于过去政策的模型。前者有很大的偏差，而后者有很大的差异。在本文中，我们将双鲁棒性技术应用于策略评估和优化问题，从而充分利用了这两种方法的优点并克服了它们的缺点。我们证明，当我们有一个好的（但不一定一致的）奖励模型或一个好的（但不一定一致的）过去政策模型时，这种方法会产生准确的价值估计。广泛的经验比较表明，双稳健方法比现有技术一致改进，实现了价值估计的低方差和更好的政策。因此，我们期望这种双重稳健的方法成为普遍做法。
---
英文标题：
《Doubly Robust Policy Evaluation and Learning》
---
作者：
Miroslav Dudik and John Langford and Lihong Li
---
最新提交年份：
2011
---
分类信息：

一级分类：Computer Science 计算机科学
二级分类：Machine Learning 机器学习
分类描述：Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.
关于机器学习研究的所有方面的论文（有监督的，无监督的，强化学习，强盗问题，等等），包括健壮性，解释性，公平性和方法论。对于机器学习方法的应用，CS.LG也是一个合适的主要类别。
--
一级分类：Computer Science 计算机科学
二级分类：Artificial Intelligence 人工智能
分类描述：Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域，除了视觉、机器人、机器学习、多智能体系统以及计算和语言（自然语言处理），这些领域有独立的学科领域。特别地，包括专家系统，定理证明（尽管这可能与计算机科学中的逻辑重叠），知识表示，规划，和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--
一级分类：Computer Science 计算机科学
二级分类：Robotics 机器人学
分类描述：Roughly includes material in ACM Subject Class I.2.9.
大致包括ACM科目I.2.9类的材料。
--
一级分类：Statistics 统计学
二级分类：Applications 应用程序
分类描述：Biology, Education, Epidemiology, Engineering, Environmental Sciences, Medical, Physical Sciences, Quality Control, Social Sciences
生物学，教育学，流行病学，工程学，环境科学，医学，物理科学，质量控制，社会科学
--
一级分类：Statistics 统计学
二级分类：Machine Learning 机器学习
分类描述：Covers machine learning papers (supervised, unsupervised, semi-supervised learning, graphical models, reinforcement learning, bandits, high dimensional inference, etc.) with a statistical or theoretical grounding
覆盖机器学习论文（监督，无监督，半监督学习，图形模型，强化学习，强盗，高维推理等）与统计或理论基础
--

---
英文摘要：
We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including health-care policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strength and overcome the weaknesses of the two approaches by applying the doubly robust technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have either a good (but not necessarily consistent) model of rewards or a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice.
---
PDF链接：
https://arxiv.org/pdf/1103.4601

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群