在马尔可夫决策过程中建立统一的政策抽象理论和代表性学习方法

论文标题

在马尔可夫决策过程中建立统一的政策抽象理论和代表性学习方法

Towards A Unified Policy Abstraction Theory and Representation Learning Approach in Markov Decision Processes

论文作者

Zhang, Min, Tang, Hongyao, Hao, Jianye, Zheng, Yan

论文摘要

在智能决策系统的核心中，如何代表和优化政策是一个基本问题。这个问题的根源挑战是政策空间的大规模和高复杂性，这加剧了政策学习的困难，尤其是在现实世界中。朝着理想的替代政策领域，最近在低维潜在空间中的政策表示表明其在改善政策的评估和优化方面的潜力。这些研究所涉及的关键问题是，我们应该通过哪些标准来抽象所需的压缩和概括的政策空间。但是，文献中对政策抽象的理论和政策表示学习方法的研究都较少。在这项工作中，我们做出了最初的努力来填补空缺。首先，我们提出了一个统一的政策抽象理论，其中包含与不同级别的政策特征相关的三种类型的策略抽象。然后，我们将它们推广到三个策略指标，以量化政策的距离（即相似性），以便在学习策略表示方面更方便使用。此外，我们建议一种基于深度度量学习的政策表示学习方法。在实证研究中，我们研究提出的政策指标和代表的功效，分别表征政策差异和传达政策概括。我们的实验均在政策优化和评估问题中进行，其中包含信任区域政策优化（TRPO），多样性引导的进化策略（DEGE）和非政策评估（OPE）。自然而然地，实验结果表明，对于所有下游学习问题，都没有普遍的最佳抽象。虽然影响力 - 征服政策抽象可以是通常的首选选择。

Lying on the heart of intelligent decision-making systems, how policy is represented and optimized is a fundamental problem. The root challenge in this problem is the large scale and the high complexity of policy space, which exacerbates the difficulty of policy learning especially in real-world scenarios. Towards a desirable surrogate policy space, recently policy representation in a low-dimensional latent space has shown its potential in improving both the evaluation and optimization of policy. The key question involved in these studies is by what criterion we should abstract the policy space for desired compression and generalization. However, both the theory on policy abstraction and the methodology on policy representation learning are less studied in the literature. In this work, we make very first efforts to fill up the vacancy. First, we propose a unified policy abstraction theory, containing three types of policy abstraction associated to policy features at different levels. Then, we generalize them to three policy metrics that quantify the distance (i.e., similarity) of policies, for more convenient use in learning policy representation. Further, we propose a policy representation learning approach based on deep metric learning. For the empirical study, we investigate the efficacy of the proposed policy metrics and representations, in characterizing policy difference and conveying policy generalization respectively. Our experiments are conducted in both policy optimization and evaluation problems, containing trust-region policy optimization (TRPO), diversity-guided evolution strategy (DGES) and off-policy evaluation (OPE). Somewhat naturally, the experimental results indicate that there is no a universally optimal abstraction for all downstream learning problems; while the influence-irrelevance policy abstraction can be a generally preferred choice.

下载PDF全文

下载文献需遵守相关版权规定

论文标题