基于模型的强化学习的控制感知表示

论文标题

基于模型的强化学习的控制感知表示

Control-Aware Representations for Model-based Reinforcement Learning

论文作者

Cui, Brandon, Chow, Yinlam, Ghavamzadeh, Mohammad

论文摘要

现代强化学习（RL）的主要挑战是从高维感觉观察中对动态系统的有效控制。学习可控嵌入（LCE）是一种有前途的方法，它通过将观测值嵌入较低维度的潜在空间，估计潜在动态并利用它在潜在空间中执行控制，从而解决了这一挑战。该领域的两个重要问题是如何学习与手头控制问题相应的表示形式，以及如何实现表示和控制的端到端框架。在本文中，我们采取了一些步骤来解决这些问题。我们首先制定了LCE模型，以学习适用于潜在空间中政策迭代样式算法使用的表示形式。我们称此模型控制感知表示学习（CARL）。我们得出了CARL的损失函数，该功能与表示学习的预测，一致性和曲率（PCC）原理有密切的联系。我们得出了卡尔的三个实现。在离线实现中，我们替换了现有的LCE方法使用RL算法的局部线性控制算法（例如，〜ILQR），即基于模型的软actor-Critic，并表明它会大大改善。在在线CARL中，我们交织了表示和控制，并证明了绩效的进一步增益。最后，我们提出了价值引导的卡尔，这是一种变体，在其中优化了CARL损耗函数的加权版本，其中权重取决于当前策略的TD-Error。我们通过对基准任务进行广泛的实验来评估所提出的算法，并将其与多个LCE基准进行比较。

A major challenge in modern reinforcement learning (RL) is efficient control of dynamical systems from high-dimensional sensory observations. Learning controllable embedding (LCE) is a promising approach that addresses this challenge by embedding the observations into a lower-dimensional latent space, estimating the latent dynamics, and utilizing it to perform control in the latent space. Two important questions in this area are how to learn a representation that is amenable to the control problem at hand, and how to achieve an end-to-end framework for representation learning and control. In this paper, we take a few steps towards addressing these questions. We first formulate a LCE model to learn representations that are suitable to be used by a policy iteration style algorithm in the latent space. We call this model control-aware representation learning (CARL). We derive a loss function for CARL that has close connection to the prediction, consistency, and curvature (PCC) principle for representation learning. We derive three implementations of CARL. In the offline implementation, we replace the locally-linear control algorithm (e.g.,~iLQR) used by the existing LCE methods with a RL algorithm, namely model-based soft actor-critic, and show that it results in significant improvement. In online CARL, we interleave representation learning and control, and demonstrate further gain in performance. Finally, we propose value-guided CARL, a variation in which we optimize a weighted version of the CARL loss function, where the weights depend on the TD-error of the current policy. We evaluate the proposed algorithms by extensive experiments on benchmark tasks and compare them with several LCE baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题