PORTER：政策轨迹合奏正规化，用于无监督的强化学习

论文标题

PORTER：政策轨迹合奏正规化，用于无监督的强化学习

POLTER: Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning

论文作者

Schubert, Frederik, Benjamins, Carolin, Döhler, Sebastian, Rosenhahn, Bodo, Lindauer, Marius

论文摘要

无监督的强化学习（URL）的目的是在任务域中找到奖励不稳定的先验政策，以便改善了监督下游任务的样本效率。尽管在下游任务上进行填补时，以这种先前的政策初始化的代理商可以获得更高的奖励，但在实践中如何实现最佳预定的先前政策，这仍然是一个悬而未决的问题。在这项工作中，我们介绍PORTER（策略轨迹集合正规化） - 一种可以应用于任何URL算法的预处理的一般方法，并且在基于数据和知识的URL算法上特别有用。它利用了在预处理过程中发现的一系列政策，并将URL算法的策略移至更接近其最佳先验的政策。我们的方法基于一个理论框架，我们分析了其对白盒基准测试的实际影响，使我们能够完全控制PORTER。在我们的主要实验中，我们评估了无监督的强化学习基准（URLB）的Polter，该学习基准（URLB）由3个域中的12个任务组成。我们通过将各种基于数据和知识的URL算法的性能平均提高19％，在最佳情况下最多可达40％，从而证明了方法的普遍性。在与调谐的基线和调谐的polter的公平比较下，我们为URLB上的无模型方法建立了新的最新方法。

The goal of Unsupervised Reinforcement Learning (URL) is to find a reward-agnostic prior policy on a task domain, such that the sample-efficiency on supervised downstream tasks is improved. Although agents initialized with such a prior policy can achieve a significantly higher reward with fewer samples when finetuned on the downstream task, it is still an open question how an optimal pretrained prior policy can be achieved in practice. In this work, we present POLTER (Policy Trajectory Ensemble Regularization) - a general method to regularize the pretraining that can be applied to any URL algorithm and is especially useful on data- and knowledge-based URL algorithms. It utilizes an ensemble of policies that are discovered during pretraining and moves the policy of the URL algorithm closer to its optimal prior. Our method is based on a theoretical framework, and we analyze its practical effects on a white-box benchmark, allowing us to study POLTER with full control. In our main experiments, we evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark (URLB), which consists of 12 tasks in 3 domains. We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19% on average and up to 40% in the best case. Under a fair comparison with tuned baselines and tuned POLTER, we establish a new state-of-the-art for model-free methods on the URLB.

下载PDF全文

下载文献需遵守相关版权规定

论文标题