面对混杂因素的悲观主义：在马尔可夫决策过程中，可证明有效的离线加强学习

论文标题

面对混杂因素的悲观主义：在马尔可夫决策过程中，可证明有效的离线加强学习

Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes

论文作者

Lu, Miao, Min, Yifei, Wang, Zhaoran, Yang, Zhuoran

论文摘要

我们在马尔可夫决策过程中研究了离线增强学习（RL）。特别是，我们旨在从行为策略收集的数据集中学习最佳策略，该策略可能取决于潜在状态。这种数据集是因为潜在状态同时影响动作和观察结果的意义，这对现有的离线RL算法却令人难以置信。为此，我们提出了\下划线{p} roxy变量\下划线{p} essimistic \ undesline {p} olicy \ undesline {o} ptimization（\ texttt {p3o}）算法，该算法解决了混乱的偏见和传播范围之间的分布范围和分布范围的行为，并在近距离上分发了行为。在\ texttt {p3o}的核心是通过因果近端推理构建的悲观置信区的耦合序列，该区域被称为最小值估计。在混杂数据集中的部分覆盖范围假设下，我们证明\ texttt {p3o}实现了$ n^{ - 1/2} $ - 次级优先级，其中$ n $是数据集中的轨迹数量。据我们所知，\ texttt {p3o}是第一个具有混杂数据集的POMDP的可证明有效的离线RL算法。

We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \texttt{P3O} achieves a $n^{-1/2}$-suboptimality, where $n$ is the number of trajectories in the dataset. To our best knowledge, \texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题