论文标题
装甲:基于模型的框架,用于通过离线数据改善任意基线政策
ARMOR: A Model-based Framework for Improving Arbitrary Baseline Policies with Offline Data
论文作者
论文摘要
我们提出了一个新的基于模型的离线RL框架,称为离线增强学习(Armour)的对抗模型,该模型可以强有力地学习政策以改进任意基线策略,无论数据覆盖范围如何。基于相对悲观的概念,盔甲旨在在面对不确定性时优化最坏情况的相对性能。从理论上讲,我们证明,学识渊博的盔甲政策永远不会通过任何可接受的超参数降低基线政策的绩效,并且可以学会与数据覆盖中的最佳政策进行竞争,当时超参数对数据覆盖进行了充分的调整,并且基线政策得到了数据的支持。这种强大的政策改进特性使装甲特别适合于建立现实世界的学习系统,因为在实践中,在考虑任何福利学习可以带来的效果之前,必须确保绩效降低。
We propose a new model-based offline RL framework, called Adversarial Models for Offline Reinforcement Learning (ARMOR), which can robustly learn policies to improve upon an arbitrary baseline policy regardless of data coverage. Based on the concept of relative pessimism, ARMOR is designed to optimize for the worst-case relative performance when facing uncertainty. In theory, we prove that the learned policy of ARMOR never degrades the performance of the baseline policy with any admissible hyperparameter, and can learn to compete with the best policy within data coverage when the hyperparameter is well tuned, and the baseline policy is supported by the data. Such a robust policy improvement property makes ARMOR especially suitable for building real-world learning systems, because in practice ensuring no performance degradation is imperative before considering any benefit learning can bring.