摘要翻译:
在强化学习中,基于梯度的直接策略搜索方法作为一种解决部分可观测性问题和避免值函数方法中与策略退化相关的一些问题的手段,近年来受到了广泛的关注。本文介绍了一种基于仿真的算法GPOMDP,它用于在参数化随机策略控制的部分可观测马尔可夫决策过程中生成{em平均报酬}梯度的{em有偏}估计。Kimura,Yamamura和Kobayashi(1995)提出了类似的算法。该算法的主要优点是,它只需要两倍于策略参数的存储,使用一个自由参数$\beta\in[0,1)$(根据偏差-方差权衡,它有一个自然的解释),并且不需要对底层状态的了解。我们证明了GPOMDP的收敛性,并给出了参数$\beta$的正确选择与受控POMDP的{\em混合时间}的关系。我们简要地描述了GPOMDP在受控马尔可夫链、连续状态、观测和控制空间、多智能体、高阶导数以及一个用内部状态训练随机策略的版本上的扩展。在另一篇论文(Baxter,Bartlett,&Weaver,2001)中,我们展示了GPOMDP生成的梯度估计如何在传统的随机梯度算法和共轭梯度过程中用于寻找平均报酬的局部最优值
---
英文标题:
《Infinite-Horizon Policy-Gradient Estimation》
---
作者:
Jonathan Baxter and Peter L. Bartlett
---
最新提交年份:
2019
---
分类信息:
一级分类:Computer Science        计算机科学
二级分类:Artificial Intelligence        
人工智能
分类描述:Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域,除了视觉、机器人、机器学习、多智能体系统以及计算和语言(自然语言处理),这些领域有独立的学科领域。特别地,包括专家系统,定理证明(尽管这可能与计算机科学中的逻辑重叠),知识表示,规划,和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--
---
英文摘要:
  Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a {\em biased} estimate of the gradient of the {\em average reward} in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter $\beta\in [0,1)$ (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter $\beta$ is related to the {\em mixing time} of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward 
---
PDF链接:
https://arxiv.org/pdf/1106.0665