强盗问题的非政策评估实用指南

310

收藏 2022-04-05

摘要翻译：
非策略评估(OPE)是从不同策略获得的样本中估计目标策略价值的问题。近年来，将OPE方法应用于土匪问题引起了广泛的关注。为了保证策略价值估计量的理论保证，OPE方法要求目标策略和用于生成样本的策略具有不同的条件。然而，现有的研究没有仔细讨论这种条件存在的实际情况，两者之间的差距仍然存在。本文旨在为弥合这一差距展示新的结果。根据评价策略的性质，我们对OPE情况进行了分类。然后，在实际应用中，我们主要讨论了最优策略选择。针对这种情况，我们提出了一种基于现有OPE估计器的元算法。我们在实验中使用合成的和开放的真实世界数据集来研究所提出的概念。
---
英文标题：
《A Practical Guide of Off-Policy Evaluation for Bandit Problems》
---
作者：
Masahiro Kato, Kenshi Abe, Kaito Ariu, Shota Yasui
---
最新提交年份：
2020
---
分类信息：

一级分类：Computer Science 计算机科学
二级分类：Machine Learning 机器学习
分类描述：Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.
关于机器学习研究的所有方面的论文（有监督的，无监督的，强化学习，强盗问题，等等），包括健壮性，解释性，公平性和方法论。对于机器学习方法的应用，CS.LG也是一个合适的主要类别。
--
一级分类：Economics 经济学
二级分类：Econometrics 计量经济学
分类描述：Econometric Theory, Micro-Econometrics, Macro-Econometrics, Empirical Content of Economic Relations discovered via New Methods, Methodological Aspects of the Application of Statistical Inference to Economic Data.
计量经济学理论，微观计量经济学，宏观计量经济学，通过新方法发现的经济关系的实证内容，统计推论应用于经济数据的方法论方面。
--
一级分类：Statistics 统计学
二级分类：Machine Learning 机器学习
分类描述：Covers machine learning papers (supervised, unsupervised, semi-supervised learning, graphical models, reinforcement learning, bandits, high dimensional inference, etc.) with a statistical or theoretical grounding
覆盖机器学习论文（监督，无监督，半监督学习，图形模型，强化学习，强盗，高维推理等）与统计或理论基础
--

---
英文摘要：
Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from samples obtained via different policies. Recently, applying OPE methods for bandit problems has garnered attention. For the theoretical guarantees of an estimator of the policy value, the OPE methods require various conditions on the target policy and policy used for generating the samples. However, existing studies did not carefully discuss the practical situation where such conditions hold, and the gap between them remains. This paper aims to show new results for bridging the gap. Based on the properties of the evaluation policy, we categorize OPE situations. Then, among practical applications, we mainly discuss the best policy selection. For the situation, we propose a meta-algorithm based on existing OPE estimators. We investigate the proposed concepts using synthetic and open real-world datasets in experiments.
---
PDF链接：
https://arxiv.org/pdf/2010.12470

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群