自我监督的顺序信息瓶颈，用于深入强化学习的强大探索

论文标题

自我监督的顺序信息瓶颈，用于深入强化学习的强大探索

Self-supervised Sequential Information Bottleneck for Robust Exploration in Deep Reinforcement Learning

论文作者

You, Bang, Xie, Jingming, Chen, Youping, Peters, Jan, Arenz, Oleg

论文摘要

有效的探索对于具有稀疏奖励或高维状态行动空间的环境中的加固学习剂至关重要。基于国家访问的数量，好奇心和熵 - 最大化的最新作品产生了内在的奖励信号，激励代理人参观新颖的国家进行探索。但是，代理可能会因包含新颖但任务含量信息的传感器输入的扰动而分心，例如由于传感器噪声或背景变化。在这项工作中，我们通过对时间序列观察中的测试和压缩顺序预测信息进行建模和压缩顺序预测信息，介绍了为学习压缩和时间连贯表示的顺序信息瓶颈目标。为了在嘈杂的环境中有效探索，我们进一步构建了内在的奖励，这些奖励基于学习的表示，以捕获与任务相关的状态新颖性。我们得出了顺序信息瓶颈目标的变异上限，以实用优化，并提供了对派生的上限的信息理论解释。与基于好奇心，熵最大化和信息获得的最新方法相比，我们对一组基于图像的模拟控制任务的实验表明，我们的方法可实现更好的样品效率和对白噪声和自然视频背景的鲁棒性。

Effective exploration is critical for reinforcement learning agents in environments with sparse rewards or high-dimensional state-action spaces. Recent works based on state-visitation counts, curiosity and entropy-maximization generate intrinsic reward signals to motivate the agent to visit novel states for exploration. However, the agent can get distracted by perturbations to sensor inputs that contain novel but task-irrelevant information, e.g. due to sensor noise or changing background. In this work, we introduce the sequential information bottleneck objective for learning compressed and temporally coherent representations by modelling and compressing sequential predictive information in time-series observations. For efficient exploration in noisy environments, we further construct intrinsic rewards that capture task-relevant state novelty based on the learned representations. We derive a variational upper bound of our sequential information bottleneck objective for practical optimization and provide an information-theoretic interpretation of the derived upper bound. Our experiments on a set of challenging image-based simulated control tasks show that our method achieves better sample efficiency, and robustness to both white noise and natural video backgrounds compared to state-of-art methods based on curiosity, entropy maximization and information-gain.

下载PDF全文

下载文献需遵守相关版权规定

论文标题