摘要翻译:
递推最小二乘(RLS)算法是应用于自适应滤波、系统辨识和自适应控制的最著名算法之一。它的流行主要是因为收敛速度快,在实践中被认为是最优的。本文将RLS方法用于解决强化学习问题,提出并分析了两种新的基于线性值函数逼近器的强化学习算法。这两种算法分别称为RLS-TD(lambda)和Fast-AHC(快速自适应启发式批评)。RLS-TD(lambda)可以看作是RLS-TD(0)在区间[0,1]内从lambda=0到一般lambda的推广,因此它是一种利用RLS方法的多步时间差分(TD)学习算法。证明了遍历马尔可夫链的概率1收敛性和RLS-TD(lambda)收敛极限。与现有的LS-TD(lambda)算法相比,RLS-TD(lambda)算法在计算量上具有优势,更适合于在线学习。通过参数设置范围较宽的马尔可夫链学习预测实验,分析并验证了RLS-TD(lambda)的有效性。将所提出的RLS-TD(lambda)算法应用于自适应启发式评价方法的评价网络中,得到了快速AHC算法。与传统的AHC算法不同,Fast-AHC算法利用RLS方法提高了算法的学习预测效率。通过对推杆平衡和acrobot摆动问题的学习控制实验,比较了快速AHC与传统AHC的数据效率。实验结果表明,在评价器的学习预测过程中,采用RLS方法也可以提高学习控制的数据效率。并将Fast-AHC与LS-TD(lambda)的AHC方法进行了性能比较。此外,实验表明,在RLS-TD(lambda)算法中,需要不同的方差矩阵初值,才能获得更好的学习预测性能和学习控制性能。在已有的遗忘因子RLS方法瞬态相位理论研究的基础上,对实验结果进行了分析。
---
英文标题:
《Efficient Reinforcement Learning Using Recursive Least-Squares Methods》
---
作者:
H. He, D. Hu, X. Xu
---
最新提交年份:
2011
---
分类信息:
一级分类:Computer Science 计算机科学
二级分类:Machine Learning
机器学习
分类描述:Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.
关于机器学习研究的所有方面的论文(有监督的,无监督的,强化学习,强盗问题,等等),包括健壮性,解释性,公平性和方法论。对于机器学习方法的应用,CS.LG也是一个合适的主要类别。
--
一级分类:Computer Science 计算机科学
二级分类:Artificial Intelligence
人工智能
分类描述:Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域,除了视觉、机器人、机器学习、多智能体系统以及计算和语言(自然语言处理),这些领域有独立的学科领域。特别地,包括专家系统,定理证明(尽管这可能与计算机科学中的逻辑重叠),知识表示,规划,和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--
---
英文摘要:
The recursive least-squares (RLS) algorithm is one of the most well-known algorithms used in adaptive filtering, system identification and adaptive control. Its popularity is mainly due to its fast convergence speed, which is considered to be optimal in practice. In this paper, RLS methods are used to solve reinforcement learning problems, where two new reinforcement learning algorithms using linear value function approximators are proposed and analyzed. The two algorithms are called RLS-TD(lambda) and Fast-AHC (Fast Adaptive Heuristic Critic), respectively. RLS-TD(lambda) can be viewed as the extension of RLS-TD(0) from lambda=0 to general lambda within interval [0,1], so it is a multi-step temporal-difference (TD) learning algorithm using RLS methods. The convergence with probability one and the limit of convergence of RLS-TD(lambda) are proved for ergodic Markov chains. Compared to the existing LS-TD(lambda) algorithm, RLS-TD(lambda) has advantages in computation and is more suitable for online learning. The effectiveness of RLS-TD(lambda) is analyzed and verified by learning prediction experiments of Markov chains with a wide range of parameter settings. The Fast-AHC algorithm is derived by applying the proposed RLS-TD(lambda) algorithm in the critic network of the adaptive heuristic critic method. Unlike conventional AHC algorithm, Fast-AHC makes use of RLS methods to improve the learning-prediction efficiency in the critic. Learning control experiments of the cart-pole balancing and the acrobot swing-up problems are conducted to compare the data efficiency of Fast-AHC with conventional AHC. From the experimental results, it is shown that the data efficiency of learning control can also be improved by using RLS methods in the learning-prediction process of the critic. The performance of Fast-AHC is also compared with that of the AHC method using LS-TD(lambda). Furthermore, it is demonstrated in the experiments that different initial values of the variance matrix in RLS-TD(lambda) are required to get better performance not only in learning prediction but also in learning control. The experimental results are analyzed based on the existing theoretical work on the transient phase of forgetting factor RLS methods.
---
PDF链接:
https://arxiv.org/pdf/1106.0707