平均奖励调整后的折扣加固学习：现实世界应用的近乎黑人最佳政策

论文标题

平均奖励调整后的折扣加固学习：现实世界应用的近乎黑人最佳政策

Average Reward Adjusted Discounted Reinforcement Learning: Near-Blackwell-Optimal Policies for Real-World Applications

论文作者

Schneckenreither, Manuel

论文摘要

尽管近年来，强化学习已变得非常流行，但在各种操作研究问题上的成功应用程序数量却相当稀缺。强化学习是基于经过深入的动态编程技术的，因此也旨在为给定的马尔可夫决策过程找到最佳的平稳政策，但相反，不需要任何模型知识。仅根据连续状态（或州行动对）评估该策略，在代理探索解决方案空间时，该策略会观察到。本文的贡献是多方面的。首先，我们为广泛应用的标准折扣增强学习框架提供了深刻的理论见解，这引起了人们对为什么永久提供非零奖励（例如成本或利润）的理解。其次，我们建立了一种新颖的近黑色优势增强算法。与以前的方法相反，它分别评估了每个步骤的平均奖励，因此防止了不同类型状态值的不可思议的组合。因此，Laurent系列的折扣状态值的扩展构成了这一开发的基础，还提供了两种方法之间的联系。最后，我们证明了我们算法对具有挑战性的问题集的生存能力，其中包括一个经过良好研究的M/M/1入学控制排队系统。与标准的折扣增强相反，学习我们的算法渗透了所有测试问题的最佳政策。见解是，在操作研究领域中，必须对机器学习技术进行调整，并在我们的设置中成功应用这些方法。

Although in recent years reinforcement learning has become very popular the number of successful applications to different kinds of operations research problems is rather scarce. Reinforcement learning is based on the well-studied dynamic programming technique and thus also aims at finding the best stationary policy for a given Markov Decision Process, but in contrast does not require any model knowledge. The policy is assessed solely on consecutive states (or state-action pairs), which are observed while an agent explores the solution space. The contributions of this paper are manifold. First we provide deep theoretical insights to the widely applied standard discounted reinforcement learning framework, which give rise to the understanding of why these algorithms are inappropriate when permanently provided with non-zero rewards, such as costs or profit. Second, we establish a novel near-Blackwell-optimal reinforcement learning algorithm. In contrary to former method it assesses the average reward per step separately and thus prevents the incautious combination of different types of state values. Thereby, the Laurent Series expansion of the discounted state values forms the foundation for this development and also provides the connection between the two approaches. Finally, we prove the viability of our algorithm on a challenging problem set, which includes a well-studied M/M/1 admission control queuing system. In contrast to standard discounted reinforcement learning our algorithm infers the optimal policy on all tested problems. The insights are that in the operations research domain machine learning techniques have to be adapted and advanced to successfully apply these methods in our settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题