价值一致的代表性学习，用于数据有效的增强学习

论文标题

价值一致的代表性学习，用于数据有效的增强学习

Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning

论文作者

Yue, Yang, Kang, Bingyi, Xu, Zhongwen, Huang, Gao, Yan, Shuicheng

论文摘要

当相互作用数据稀缺时，深厚的增强学习（RL）算法遭受了严重的性能下降，这限制了其现实世界的应用。最近，视觉表示学习已被证明是有效的，并且有望提高RL样品效率。这些方法通常依赖于对比度学习和数据扩展来训练状态预测的过渡模型，这与在RL中使用的模型不同 - 基于价值的计划。因此，通过这些视觉方法学到的表示的表示可能对识别有益，但对于估计状态价值和解决决策问题不是最佳的。为了解决这个问题，我们提出了一种新颖的方法，称为价值一致的表示学习（VCR），以学习与决策直接相关的表示形式。更具体地说，VCR训练一个模型，以预测基于当前的状态（也称为“想象的状态”）和一系列动作。 VCR没有将这个想象中的状态与环境返回的真实状态保持一致，而是在两个状态下应用$ q $ - 价值头，并获得了两个行动值分布。然后将距离计算并最小化以迫使想象的状态产生与真实状态相似的动作值预测。我们为离散和连续的动作空间开发了上述想法的两个实现。我们对Atari 100K和DeepMind Control Suite基准进行实验，以验证其提高样品效率的有效性。已经证明，我们的方法实现了无搜索RL算法的新最新性能。

Deep reinforcement learning (RL) algorithms suffer severe performance degradation when the interaction data is scarce, which limits their real-world application. Recently, visual representation learning has been shown to be effective and promising for boosting sample efficiency in RL. These methods usually rely on contrastive learning and data augmentation to train a transition model for state prediction, which is different from how the model is used in RL--performing value-based planning. Accordingly, the learned representation by these visual methods may be good for recognition but not optimal for estimating state value and solving the decision problem. To address this issue, we propose a novel method, called value-consistent representation learning (VCR), to learn representations that are directly related to decision-making. More specifically, VCR trains a model to predict the future state (also referred to as the ''imagined state'') based on the current one and a sequence of actions. Instead of aligning this imagined state with a real state returned by the environment, VCR applies a $Q$-value head on both states and obtains two distributions of action values. Then a distance is computed and minimized to force the imagined state to produce a similar action value prediction as that by the real state. We develop two implementations of the above idea for the discrete and continuous action spaces respectively. We conduct experiments on Atari 100K and DeepMind Control Suite benchmarks to validate their effectiveness for improving sample efficiency. It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题