共同学习环境和控制随机梯度上升的政策

论文标题

共同学习环境和控制随机梯度上升的政策

Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent

论文作者

Bolland, Adrien, Boukas, Ioannis, Berger, Mathias, Ernst, Damien

论文摘要

我们考虑在有限的时间范围内离散时间随机动力学系统的联合设计和控制。在不确定性下，我们将问题提出为多步优化问题，以寻求确定系统设计和控制策略，从而共同最大化所考虑的时间范围内收集的预期奖励总和。过渡函数，奖励函数和策略均已参数化，相对于其参数已知和可区分。然后，我们引入了一种深入的强化学习算法，将策略梯度方法与基于模型的优化技术相结合，以解决此问题。从本质上讲，我们的算法迭代通过蒙特卡洛采样和自动分化近似于预期返回的梯度，并在环境和策略参数的空间中采取了预计的梯度上升步骤。该算法称为直接环境和政策搜索（DEPS）。我们在三种与质量弹簧式抑制系统的设计和控制的环境中评估算法的性能，分别是小规模的离网电源系统和无人机。此外，我们的算法是针对用于解决关节设计和控制问题的最先进的深入学习算法的基准测试的。我们表明，在所有三种环境中，DEP至少也能表现出色或更高，在更少的迭代次数中始终产生较高回报的解决方案。最后，我们的算法生产的解决方案也与未共同优化环境和策略参数的算法产生的解决方案进行了比较，这突出了一个事实，即在执行关节优化时可以实现较高的回报。

We consider the joint design and control of discrete-time stochastic dynamical systems over a finite time horizon. We formulate the problem as a multi-step optimization problem under uncertainty seeking to identify a system design and a control policy that jointly maximize the expected sum of rewards collected over the time horizon considered. The transition function, the reward function and the policy are all parametrized, assumed known and differentiable with respect to their parameters. We then introduce a deep reinforcement learning algorithm combining policy gradient methods with model-based optimization techniques to solve this problem. In essence, our algorithm iteratively approximates the gradient of the expected return via Monte-Carlo sampling and automatic differentiation and takes projected gradient ascent steps in the space of environment and policy parameters. This algorithm is referred to as Direct Environment and Policy Search (DEPS). We assess the performance of our algorithm in three environments concerned with the design and control of a mass-spring-damper system, a small-scale off-grid power system and a drone, respectively. In addition, our algorithm is benchmarked against a state-of-the-art deep reinforcement learning algorithm used to tackle joint design and control problems. We show that DEPS performs at least as well or better in all three environments, consistently yielding solutions with higher returns in fewer iterations. Finally, solutions produced by our algorithm are also compared with solutions produced by an algorithm that does not jointly optimize environment and policy parameters, highlighting the fact that higher returns can be achieved when joint optimization is performed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题