论文标题
使用新的模块化体系结构评估强化学习中的政策,损失和计划组合
Assessing Policy, Loss and Planning Combinations in Reinforcement Learning using a New Modular Architecture
论文作者
论文摘要
使用计划算法和神经网络模型的基于模型的增强学习范式最近在各种应用中取得了前所未有的结果,从而导致了现在所谓的深入强化学习。这些代理非常复杂,并且涉及多个组成部分,这些因素可以为研究带来挑战。在这项工作中,我们提出了一种适合这些类型代理的新模块化软件体系结构,以及一套可以轻松重复使用和组装以构建新的基于模型的强化学习剂的构建块。这些构件包括计划算法,政策和损失功能。 我们通过组合其中几个构建块来实现和测试对三种不同的测试环境进行优化的代理来说明这种体系结构的使用:Cartpole,Minigrid和Tictactoe。一种特定的计划算法,在我们的实施中提供,而不是以前在增强学习中使用,我们称为平均Minimax,在三个经过测试的环境中取得了良好的效果。 使用该体系结构进行的实验表明,计划算法,策略和损失功能的最佳组合依赖于问题。该结果提供了证据表明,所提出的模块化和可重复使用的拟议体系对于想要研究新环境和技术的强化学习研究人员很有用。
The model-based reinforcement learning paradigm, which uses planning algorithms and neural network models, has recently achieved unprecedented results in diverse applications, leading to what is now known as deep reinforcement learning. These agents are quite complex and involve multiple components, factors that can create challenges for research. In this work, we propose a new modular software architecture suited for these types of agents, and a set of building blocks that can be easily reused and assembled to construct new model-based reinforcement learning agents. These building blocks include planning algorithms, policies, and loss functions. We illustrate the use of this architecture by combining several of these building blocks to implement and test agents that are optimized to three different test environments: Cartpole, Minigrid, and Tictactoe. One particular planning algorithm, made available in our implementation and not previously used in reinforcement learning, which we called averaged minimax, achieved good results in the three tested environments. Experiments performed with this architecture have shown that the best combination of planning algorithm, policy, and loss function is heavily problem dependent. This result provides evidence that the proposed architecture, which is modular and reusable, is useful for reinforcement learning researchers who want to study new environments and techniques.