多臂强盗问题的纯探索

768

收藏 2022-04-06

摘要翻译：
我们考虑了随机多武器强盗问题的框架，并研究了预报员执行在线武器探测的可能性和局限性。对这些预测者的评估是以其简单的遗憾来衡量的，遗憾概念反映了这样一个事实，即勘探只受可用轮次的限制（不一定事先知道），而不是考虑累积遗憾和需要同时进行开采的情况。我们认为，这一业绩标准适用于用资源而不是报酬来表示拉一只胳膊的成本的情况。我们讨论了简单遗憾和累积遗憾之间的联系。在有限数量的武器的情况下，一个主要结果是关于预报员的简单遗憾的一般下界，即它的累积遗憾：后者越小，前者越大。记住这个结果，然后我们展示了一些预报员的简单遗憾的上限。论文最后研究了连续武装的土匪问题；我们证明了对于一族概率分布，简单后悔可以最小化当且仅当累积后悔可以最小化。基于这种等价性，我们可以证明，对于所有具有连续均值-收益函数的概率分布族，可分度量空间正是这些遗憾可以最小化的度量空间。
---
英文标题：
《Pure Exploration for Multi-Armed Bandit Problems》
---
作者：
S\'ebastien Bubeck (INRIA Futurs), R\'emi Munos (INRIA Futurs), Gilles
Stoltz (DMA, GREGH)
---
最新提交年份：
2010
---
分类信息：

一级分类：Mathematics 数学
二级分类：Statistics Theory 统计理论
分类描述：Applied, computational and theoretical statistics: e.g. statistical inference, regression, time series, multivariate analysis, data analysis, Markov chain Monte Carlo, design of experiments, case studies
应用统计、计算统计和理论统计：例如统计推断、回归、时间序列、多元分析、数据分析、马尔可夫链蒙特卡罗、实验设计、案例研究
--
一级分类：Computer Science 计算机科学
二级分类：Machine Learning 机器学习
分类描述：Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.
关于机器学习研究的所有方面的论文（有监督的，无监督的，强化学习，强盗问题，等等），包括健壮性，解释性，公平性和方法论。对于机器学习方法的应用，CS.LG也是一个合适的主要类别。
--
一级分类：Statistics 统计学
二级分类：Statistics Theory 统计理论
分类描述：stat.TH is an alias for math.ST. Asymptotics, Bayesian Inference, Decision Theory, Estimation, Foundations, Inference, Testing.
Stat.Th是Math.St的别名。渐近，贝叶斯推论，决策理论，估计，基础，推论，检验。
--

---
英文摘要：
We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of forecasters that perform an on-line exploration of the arms. These forecasters are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time. We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. One of the main results in the case of a finite number of arms is a general lower bound on the simple regret of a forecaster in terms of its cumulative regret: the smaller the latter, the larger the former. Keeping this result in mind, we then exhibit upper bounds on the simple regret of some forecasters. The paper ends with a study devoted to continuous-armed bandit problems; we show that the simple regret can be minimized with respect to a family of probability distributions if and only if the cumulative regret can be minimized for it. Based on this equivalence, we are able to prove that the separable metric spaces are exactly the metric spaces on which these regrets can be minimized with respect to the family of all probability distributions with continuous mean-payoff functions.
---
PDF链接：
https://arxiv.org/pdf/802.2655

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群