加强POMDP的确定性政策

论文标题

加强POMDP的确定性政策

Strengthening Deterministic Policies for POMDPs

论文作者

Winterer, Leonore, Wimmer, Ralf, Jansen, Nils, Becker, Bernd

论文摘要

部分可观察到的马尔可夫决策过程（POMDP）的综合问题是计算满足给定规范的策略。此类政策必须考虑POMDP的完整执行历史记录，从而使该问题一般不可确定。一种常见的方法是使用有限的内存并在潜在选择上随机分配。然而，这个问题仍然是NP坚硬的，在实践中通常在计算上棘手。一个受限制的问题是既不使用历史记录也不随机化，从而产生称为静止和确定性的政策。计算此类政策的先前方法采用混合企业线性编程（MILP）。我们提供了一种新颖的MILP编码，该编码以时间逻辑约束的形式支持复杂的规格。它能够处理任意数量的此类规格。但是，通常必须进行随机分组和记忆力来实现令人满意的政策。首先，我们将编码扩展到提供限制的随机策略类别。其次，根据原始MILP的结果，我们采用对POMDP的预处理来包含基于内存的决策。我们方法比最先进的POMDP求解器的优势在于（1）的灵活性，可以增强简单的确定性策略而不会失去计算障碍性，并且（2）能够实施对任意多种规格的可证明满意度。后一点允许考虑典型POMDP示例的性能和安全方面的权衡。我们在广泛的基准中显示了我们方法的有效性。

The synthesis problem for partially observable Markov decision processes (POMDPs) is to compute a policy that satisfies a given specification. Such policies have to take the full execution history of a POMDP into account, rendering the problem undecidable in general. A common approach is to use a limited amount of memory and randomize over potential choices. Yet, this problem is still NP-hard and often computationally intractable in practice. A restricted problem is to use neither history nor randomization, yielding policies that are called stationary and deterministic. Previous approaches to compute such policies employ mixed-integer linear programming (MILP). We provide a novel MILP encoding that supports sophisticated specifications in the form of temporal logic constraints. It is able to handle an arbitrary number of such specifications. Yet, randomization and memory are often mandatory to achieve satisfactory policies. First, we extend our encoding to deliver a restricted class of randomized policies. Second, based on the results of the original MILP, we employ a preprocessing of the POMDP to encompass memory-based decisions. The advantages of our approach over state-of-the-art POMDP solvers lie (1) in the flexibility to strengthen simple deterministic policies without losing computational tractability and (2) in the ability to enforce the provable satisfaction of arbitrarily many specifications. The latter point allows taking trade-offs between performance and safety aspects of typical POMDP examples into account. We show the effectiveness of our method on a broad range of benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题