基于动量的政策梯度，二阶信息

论文标题

基于动量的政策梯度，二阶信息

Momentum-Based Policy Gradient with Second-Order Information

论文作者

Salehkaleybar, Saber, Khorasani, Sadegh, Kiyavash, Negar, He, Niao, Thiran, Patrick

论文摘要

降低了策略梯度方法的方差梯度估计器一直是近年来增强学习研究的主要重点之一，因为它们允许加速估算过程。我们提出了一种称为Sharp的差异策略梯度方法，该方法将二阶信息纳入随机梯度下降（SGD），并使用动量和时间变化的学习率。 Sharp Algorithm是无参数的，可以使用$ O（ε^{ - 3}）$ o（ε^{ - 3}）$ a轨迹数量达到$ε$ - appprothm，同时在每次迭代时使用批量的$ o（1）$。与以前的大多数工作不同，我们提出的算法不需要重要的抽样，这可能会损害降低方差的优势。此外，估计错误的差异会随$ O（1/t^{2/3}）$的快速速率衰减，其中$ t $是迭代的数量。我们广泛的实验评估表明，所提出的算法对各种控制任务的有效性及其对实践中最新状态的优势。

Variance-reduced gradient estimators for policy gradient methods have been one of the main focus of research in the reinforcement learning in recent years as they allow acceleration of the estimation process. We propose a variance-reduced policy-gradient method, called SHARP, which incorporates second-order information into stochastic gradient descent (SGD) using momentum with a time-varying learning rate. SHARP algorithm is parameter-free, achieving $ε$-approximate first-order stationary point with $O(ε^{-3})$ number of trajectories, while using a batch size of $O(1)$ at each iteration. Unlike most previous work, our proposed algorithm does not require importance sampling which can compromise the advantage of variance reduction process. Moreover, the variance of estimation error decays with the fast rate of $O(1/t^{2/3})$ where $t$ is the number of iterations. Our extensive experimental evaluations show the effectiveness of the proposed algorithm on various control tasks and its advantage over the state of the art in practice.

下载PDF全文

下载文献需遵守相关版权规定

论文标题