二人零和马尔可夫对策的非策略可利用性评价

301

收藏 2022-04-06

摘要翻译：
非策略评估(OPE)是利用从不同策略获得的历史数据来评估新策略的问题。在最近的OPE背景下，大多数研究都集中在单人情况下，而不是多人情况下。在本研究中，我们在二人零和马氏对策中提出了由双鲁棒和双强化学习估计构造的OPE估计。所提出的估计量预测了可利用性，可利用性通常被用作确定策略配置文件（即策略的元组）与两人零和博弈中纳什均衡的接近程度的度量。我们证明了所提出的估计量的可开发性估计误差界。然后，我们提出了通过从给定的策略概要类中选择最小化估计的可利用性的策略概要来寻找最佳候选策略概要的方法。我们证明了用我们的方法选择的策略配置文件的后悔界。最后，我们通过实验证明了所提出的估计器的有效性和性能。
---
英文标题：
《Off-Policy Exploitability-Evaluation in Two-Player Zero-Sum Markov Games》
---
作者：
Kenshi Abe, Yusuke Kaneko
---
最新提交年份：
2020
---
分类信息：

一级分类：Computer Science 计算机科学
二级分类：Machine Learning 机器学习
分类描述：Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.
关于机器学习研究的所有方面的论文（有监督的，无监督的，强化学习，强盗问题，等等），包括健壮性，解释性，公平性和方法论。对于机器学习方法的应用，CS.LG也是一个合适的主要类别。
--
一级分类：Computer Science 计算机科学
二级分类：Computer Science and Game Theory 计算机科学与博弈论
分类描述：Covers all theoretical and applied aspects at the intersection of computer science and game theory, including work in mechanism design, learning in games (which may overlap with Learning), foundations of agent modeling in games (which may overlap with Multiagent systems), coordination, specification and formal methods for non-cooperative computational environments. The area also deals with applications of game theory to areas such as electronic commerce.
涵盖计算机科学和博弈论交叉的所有理论和应用方面，包括机制设计的工作，游戏中的学习（可能与学习重叠），游戏中的agent建模的基础（可能与多agent系统重叠），非合作计算环境的协调、规范和形式化方法。该领域还涉及博弈论在电子商务等领域的应用。
--
一级分类：Economics 经济学
二级分类：Econometrics 计量经济学
分类描述：Econometric Theory, Micro-Econometrics, Macro-Econometrics, Empirical Content of Economic Relations discovered via New Methods, Methodological Aspects of the Application of Statistical Inference to Economic Data.
计量经济学理论，微观计量经济学，宏观计量经济学，通过新方法发现的经济关系的实证内容，统计推论应用于经济数据的方法论方面。
--
一级分类：Statistics 统计学
二级分类：Machine Learning 机器学习
分类描述：Covers machine learning papers (supervised, unsupervised, semi-supervised learning, graphical models, reinforcement learning, bandits, high dimensional inference, etc.) with a statistical or theoretical grounding
覆盖机器学习论文（监督，无监督，半监督学习，图形模型，强化学习，强盗，高维推理等）与统计或理论基础
--

---
英文摘要：
Off-policy evaluation (OPE) is the problem of evaluating new policies using historical data obtained from a different policy. In the recent OPE context, most studies have focused on single-player cases, and not on multi-player cases. In this study, we propose OPE estimators constructed by the doubly robust and double reinforcement learning estimators in two-player zero-sum Markov games. The proposed estimators project exploitability that is often used as a metric for determining how close a policy profile (i.e., a tuple of policies) is to a Nash equilibrium in two-player zero-sum games. We prove the exploitability estimation error bounds for the proposed estimators. We then propose the methods to find the best candidate policy profile by selecting the policy profile that minimizes the estimated exploitability from a given policy profile class. We prove the regret bounds of the policy profiles selected by our methods. Finally, we demonstrate the effectiveness and performance of the proposed estimators through experiments.
---
PDF链接：
https://arxiv.org/pdf/2007.02141

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群