使用非政策排名提高进化RL样品效率

论文标题

使用非政策排名提高进化RL样品效率

Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking

论文作者

R, Eshwar S, Kolathaya, Shishir, Thoppe, Gugan

论文摘要

进化策略（ES）是一种基于自然进化思想的强大黑盒优化技术。在每个迭代中，一个关键步骤都需要根据一些健身分数进行排名候选解决方案。对于增强学习（RL）的ES方法，此排名步骤需要评估多个策略。目前是通过政策方法完成的：通过使用该策略与环境进行多次交互来估算每个策略的分数。这会导致很多浪费的互动，因为一旦排名完成，只有与排名最高的策略相关的数据用于后续学习。为了提高样品效率，我们基于适应性函数的局部近似，提出了一种新型的分支替代方案。我们在称为增强随机搜索（ARS）的最先进的ES方法的背景下演示了我们的想法。 Mujoco任务中的仿真表明，与原始ARS相比，我们的非政策变体具有相似的运行时间来达到奖励阈值，但仅需要70％左右的数据。它还胜过最近的信任区域。我们认为我们的想法也应该扩展到其他ES方法。

Evolution Strategy (ES) is a powerful black-box optimization technique based on the idea of natural evolution. In each of its iterations, a key step entails ranking candidate solutions based on some fitness score. For an ES method in Reinforcement Learning (RL), this ranking step requires evaluating multiple policies. This is presently done via on-policy approaches: each policy's score is estimated by interacting several times with the environment using that policy. This leads to a lot of wasteful interactions since, once the ranking is done, only the data associated with the top-ranked policies is used for subsequent learning. To improve sample efficiency, we propose a novel off-policy alternative for ranking, based on a local approximation for the fitness function. We demonstrate our idea in the context of a state-of-the-art ES method called the Augmented Random Search (ARS). Simulations in MuJoCo tasks show that, compared to the original ARS, our off-policy variant has similar running times for reaching reward thresholds but needs only around 70% as much data. It also outperforms the recent Trust Region ES. We believe our ideas should be extendable to other ES methods as well.

下载PDF全文

下载文献需遵守相关版权规定

论文标题