全部版块 我的主页
论坛 经济学人 二区 外文文献专区
381 0
2022-03-20
摘要翻译:
非策略评估(OPE)的目标是使用通过行为策略获得的历史数据来评估一个新的策略。然而,由于上下文bandit算法是基于过去的观察更新策略的,所以样本不是独立的和同分布的(I.I.D.)。本文通过对相依样本的鞅差序列(MDS)构造估计量来解决这个问题。在数据生成过程中,我们不假设策略的收敛性,但策略在一定时间内使用相同的条件概率选择一个动作。然后,我们导出了一个评估策略值的渐近正态估计。作为我们方法的另一个优点,基于批处理的方法同时解决了缺乏支持的问题。利用基准测试和实际数据集,我们实验验证了所提方法的有效性。
---
英文标题:
《Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under
  Batch Update Policy》
---
作者:
Masahiro Kato and Yusuke Kaneko
---
最新提交年份:
2020
---
分类信息:

一级分类:Computer Science        计算机科学
二级分类:Machine Learning        机器学习
分类描述:Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.
关于机器学习研究的所有方面的论文(有监督的,无监督的,强化学习,强盗问题,等等),包括健壮性,解释性,公平性和方法论。对于机器学习方法的应用,CS.LG也是一个合适的主要类别。
--
一级分类:Economics        经济学
二级分类:Econometrics        计量经济学
分类描述:Econometric Theory, Micro-Econometrics, Macro-Econometrics, Empirical Content of Economic Relations discovered via New Methods, Methodological Aspects of the Application of Statistical Inference to Economic Data.
计量经济学理论,微观计量经济学,宏观计量经济学,通过新方法发现的经济关系的实证内容,统计推论应用于经济数据的方法论方面。
--
一级分类:Statistics        统计学
二级分类:Machine Learning        机器学习
分类描述:Covers machine learning papers (supervised, unsupervised, semi-supervised learning, graphical models, reinforcement learning, bandits, high dimensional inference, etc.) with a statistical or theoretical grounding
覆盖机器学习论文(监督,无监督,半监督学习,图形模型,强化学习,强盗,高维推理等)与统计或理论基础
--

---
英文摘要:
  The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy. However, because the contextual bandit algorithm updates the policy based on past observations, the samples are not independent and identically distributed (i.i.d.). This paper tackles this problem by constructing an estimator from a martingale difference sequence (MDS) for the dependent samples. In the data-generating process, we do not assume the convergence of the policy, but the policy uses the same conditional probability of choosing an action during a certain period. Then, we derive an asymptotically normal estimator of the value of an evaluation policy. As another advantage of our method, the batch-based approach simultaneously solves the deficient support problem. Using benchmark and real-world datasets, we experimentally confirm the effectiveness of the proposed method.
---
PDF链接:
https://arxiv.org/pdf/2010.13554
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群