论文标题
强大的导弹自动驾驶设计的强化学习
Reinforcement Learning for Robust Missile Autopilot Design
论文作者
论文摘要
鉴于广泛的飞行信封和非线性飞行动力学,设计导弹的自动驾驶控制器一直是一项复杂的任务。仍然可以找到一种可以在名义性能和鲁棒性方面表现出色的解决方案。尽管控制理论通常会逐渐陷入参数的调度程序中,但强化学习在越来越复杂的任务中取得了有趣的结果,从视频游戏到具有连续动作域的机器人任务。但是,它仍然缺乏有关如何找到适当的奖励功能和探索策略的更清晰的见解。据我们所知,这项工作是提出强化学习作为飞行控制框架的先驱。实际上,它旨在训练可以控制导弹的纵向飞行的无模型代理,从而实现最佳性能和对不确定性的稳健性。为此,根据TRPO的方法论,根据她的收集经验增强了收集的经验,并根据其意义进行了重播缓冲液。这项工作不仅可以增强优先经验重播的概念,而且还重新制定了她,只有在训练进度收敛到次优政策的情况下,就可以激活它们,这是SER方法论所提出的。此外,奖励工程过程仔细详细介绍。结果表明,通过在非社会环境中进一步训练它,可以实现最佳性能并提高代理商对不确定性的鲁棒性(对标称性能的损害低),从而验证了拟议的方法并鼓励该领域的未来研究。
Designing missiles' autopilot controllers has been a complex task, given the extensive flight envelope and the nonlinear flight dynamics. A solution that can excel both in nominal performance and in robustness to uncertainties is still to be found. While Control Theory often debouches into parameters' scheduling procedures, Reinforcement Learning has presented interesting results in ever more complex tasks, going from videogames to robotic tasks with continuous action domains. However, it still lacks clearer insights on how to find adequate reward functions and exploration strategies. To the best of our knowledge, this work is pioneer in proposing Reinforcement Learning as a framework for flight control. In fact, it aims at training a model-free agent that can control the longitudinal flight of a missile, achieving optimal performance and robustness to uncertainties. To that end, under TRPO's methodology, the collected experience is augmented according to HER, stored in a replay buffer and sampled according to its significance. Not only does this work enhance the concept of prioritized experience replay into BPER, but it also reformulates HER, activating them both only when the training progress converges to suboptimal policies, in what is proposed as the SER methodology. Besides, the Reward Engineering process is carefully detailed. The results show that it is possible both to achieve the optimal performance and to improve the agent's robustness to uncertainties (with low damage on nominal performance) by further training it in non-nominal environments, therefore validating the proposed approach and encouraging future research in this field.