全部版块 我的主页
论坛 经济学人 二区 外文文献专区
545 0
2022-04-02
摘要翻译:
我们考虑了无限时域的$\gamma$-贴现马尔可夫决策过程,它已知存在一个平稳的最优策略。我们考虑算法值迭代和策略序列$\pi_1,...\pi_k$它隐含地生成直到某个迭代$k$。我们为涉及最后$m$生成的策略的非平稳策略提供了性能边界,这些边界将最后一个平稳策略$\pi_k$的最新边界降低了一个因子$\frac{1-\gamma}{1-\gamma^m}$。特别是,非平稳策略的使用允许将每次迭代误差以$\epsilon$为界的值迭代的通常渐近性能界从$\frac{\gamma}{(1-\gamma)^2}\epsilon$降低到$\frac{\gamma}{1-\gamma}\epsilon$,这在$\gamma$接近1时的通常情况下是很重要的。给定Bellman算子只能以某种错误$\epsilon$计算,这个结果的一个令人惊讶的结果是,“计算一个近似最优的非平稳策略”的问题比“计算一个近似最优的平稳策略”的问题简单得多,甚至比“近似计算某个固定策略的值”的问题稍微简单,因为最后一个问题只有$\frac{1}{1-\gamma}\epsilon$的保证。
---
英文标题:
《On the Use of Non-Stationary Policies for Infinite-Horizon Discounted
  Markov Decision Processes》
---
作者:
Bruno Scherrer (INRIA Lorraine - LORIA)
---
最新提交年份:
2012
---
分类信息:

一级分类:Computer Science        计算机科学
二级分类:Artificial Intelligence        人工智能
分类描述:Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域,除了视觉、机器人、机器学习、多智能体系统以及计算和语言(自然语言处理),这些领域有独立的学科领域。特别地,包括专家系统,定理证明(尽管这可能与计算机科学中的逻辑重叠),知识表示,规划,和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--

---
英文摘要:
  We consider infinite-horizon $\gamma$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. We consider the algorithm Value Iteration and the sequence of policies $\pi_1,...,\pi_k$ it implicitely generates until some iteration $k$. We provide performance bounds for non-stationary policies involving the last $m$ generated policies that reduce the state-of-the-art bound for the last stationary policy $\pi_k$ by a factor $\frac{1-\gamma}{1-\gamma^m}$. In particular, the use of non-stationary policies allows to reduce the usual asymptotic performance bounds of Value Iteration with errors bounded by $\epsilon$ at each iteration from $\frac{\gamma}{(1-\gamma)^2}\epsilon$ to $\frac{\gamma}{1-\gamma}\epsilon$, which is significant in the usual situation when $\gamma$ is close to 1. Given Bellman operators that can only be computed with some error $\epsilon$, a surprising consequence of this result is that the problem of "computing an approximately optimal non-stationary policy" is much simpler than that of "computing an approximately optimal stationary policy", and even slightly simpler than that of "approximately computing the value of some fixed policy", since this last problem only has a guarantee of $\frac{1}{1-\gamma}\epsilon$.
---
PDF链接:
https://arxiv.org/pdf/1203.5532
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群