使用合作网络的评论家算法

论文标题

使用合作网络的评论家算法

Critic Algorithms using Cooperative Networks

论文作者

Banerjee, Debangshu, Wagh, Kavita

论文摘要

提出了一种算法，用于马尔可夫决策过程中的政策评估，该算法就收敛率提供了良好的经验结果。该算法跟踪了投影的贝尔曼错误，并将其实现为基于梯度的算法。在这方面，该算法与TD（$λ$）类别的算法不同。该算法跟踪了投影的Bellman算法，因此与残留算法类别不同。此外，该算法的融合在经验上比旨在跟踪预计的贝尔曼错误的算法的GTD2类算法要快得多。我们在DQN和DDPG框架中实施了提出的算法，发现我们的算法在这两个实验中都取得了可比的结果

An algorithm is proposed for policy evaluation in Markov Decision Processes which gives good empirical results with respect to convergence rates. The algorithm tracks the Projected Bellman Error and is implemented as a true gradient based algorithm. In this respect this algorithm differs from TD($λ$) class of algorithms. This algorithm tracks the Projected Bellman Algorithm and is therefore different from the class of residual algorithms. Further the convergence of this algorithm is empirically much faster than GTD2 class of algorithms which aim at tracking the Projected Bellman Error. We implemented proposed algorithm in DQN and DDPG framework and found that our algorithm achieves comparable results in both of these experiments

下载PDF全文

下载文献需遵守相关版权规定

论文标题