全部版块 我的主页
论坛 经济学人 二区 外文文献专区
459 0
2022-04-14
摘要翻译:
本文研究了基于bandit算法的相依样本的非策略评估问题。OPE的目标是使用从bandit算法生成的行为策略中获得的历史数据来评估一个新的策略。由于bandit算法是根据过去的观测数据更新策略的,所以样本不是独立的和同分布的(I.I.D.)。然而,现有的几种OPE方法没有考虑到这一问题,而是基于样本是I.I.D.的假设。在本研究中,我们通过从一个标准化鞅差序列构造一个估计量来解决这个问题。为了标准化序列,我们考虑使用评价数据或样本分裂与两步估计。该方法在不限制一类行为策略的情况下产生了一个具有渐近正态性的估计量。在一个实验中,所提出的估计器比现有的假设行为策略收敛到一个时不变策略的方法性能更好。
---
英文标题:
《Confidence Interval for Off-Policy Evaluation from Dependent Samples via
  Bandit Algorithm: Approach from Standardized Martingales》
---
作者:
Masahiro Kato
---
最新提交年份:
2020
---
分类信息:

一级分类:Statistics        统计学
二级分类:Machine Learning        机器学习
分类描述:Covers machine learning papers (supervised, unsupervised, semi-supervised learning, graphical models, reinforcement learning, bandits, high dimensional inference, etc.) with a statistical or theoretical grounding
覆盖机器学习论文(监督,无监督,半监督学习,图形模型,强化学习,强盗,高维推理等)与统计或理论基础
--
一级分类:Computer Science        计算机科学
二级分类:Machine Learning        机器学习
分类描述:Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.
关于机器学习研究的所有方面的论文(有监督的,无监督的,强化学习,强盗问题,等等),包括健壮性,解释性,公平性和方法论。对于机器学习方法的应用,CS.LG也是一个合适的主要类别。
--
一级分类:Economics        经济学
二级分类:Econometrics        计量经济学
分类描述:Econometric Theory, Micro-Econometrics, Macro-Econometrics, Empirical Content of Economic Relations discovered via New Methods, Methodological Aspects of the Application of Statistical Inference to Economic Data.
计量经济学理论,微观计量经济学,宏观计量经济学,通过新方法发现的经济关系的实证内容,统计推论应用于经济数据的方法论方面。
--
一级分类:Statistics        统计学
二级分类:Methodology        方法论
分类描述:Design, Surveys, Model Selection, Multiple Testing, Multivariate Methods, Signal and Image Processing, Time Series, Smoothing, Spatial Statistics, Survival Analysis, Nonparametric and Semiparametric Methods
设计,调查,模型选择,多重检验,多元方法,信号和图像处理,时间序列,平滑,空间统计,生存分析,非参数和半参数方法
--

---
英文摘要:
  This study addresses the problem of off-policy evaluation (OPE) from dependent samples obtained via the bandit algorithm. The goal of OPE is to evaluate a new policy using historical data obtained from behavior policies generated by the bandit algorithm. Because the bandit algorithm updates the policy based on past observations, the samples are not independent and identically distributed (i.i.d.). However, several existing methods for OPE do not take this issue into account and are based on the assumption that samples are i.i.d. In this study, we address this problem by constructing an estimator from a standardized martingale difference sequence. To standardize the sequence, we consider using evaluation data or sample splitting with a two-step estimation. This technique produces an estimator with asymptotic normality without restricting a class of behavior policies. In an experiment, the proposed estimator performs better than existing methods, which assume that the behavior policy converges to a time-invariant policy.
---
PDF链接:
https://arxiv.org/pdf/2006.06982
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群