基于受限风险敏感的强化学习基于限制的强大最佳控制

论文标题

基于受限风险敏感的强化学习基于限制的强大最佳控制

Off-Policy Risk-Sensitive Reinforcement Learning Based Constrained Robust Optimal Control

论文作者

Li, Cong, Liu, Qingchen, Zhou, Zhehua, Buss, Martin, Liu, Fangzhou

论文摘要

本文提出了一个基于非政策敏感性的增强式学习框架，用于稳定连续时间非线性系统，该系统受到加性干扰，输入饱和度和状态约束的影响。通过引入伪控制和风险敏感的输入和状态罚款，原始系统的强大稳定问题被限制在辅助系统的等效最佳控制问题上。然后，针对变换的最佳控制问题，我们采用了自适应动态编程（ADP）作为单个评论家结构实施的，以获得近似的解决方案，以实现汉密尔顿 - 雅各布斯 - 贝尔曼（HJB）方程的价值函数，从而导致近似最佳控制策略能够满足不受干扰的输入和状态约束。通过将体验数据重新播放到评论家人工神经网络的非政策重量更新定律，可以保证重量收敛。此外，要获取体验数据以实现体重收敛所需的足够激发，在线和离线算法是为了作为记录信息丰富经验数据的原则方法。等效证明表明，辅助系统的最佳控制策略可靠地稳定原始系统，而无需违反输入和状态约束。提供了系统稳定性和重量收敛的证明。仿真结果揭示了所提出的控制框架的有效性。

This paper proposes an off-policy risk-sensitive reinforcement learning based control framework for stabilization of a continuous-time nonlinear system that subjects to additive disturbances, input saturation, and state constraints. By introducing pseudo controls and risk-sensitive input and state penalty terms, the constrained robust stabilization problem of the original system is converted into an equivalent optimal control problem of an auxiliary system. Then, aiming at the transformed optimal control problem, we adopt adaptive dynamic programming (ADP) implemented as a single critic structure to get the approximate solution to the value function of the Hamilton-Jacobi-Bellman (HJB) equation, which results in the approximate optimal control policy that is able to satisfy both input and state constraints under disturbances. By replaying experience data to the off-policy weight update law of the critic artificial neural network, the weight convergence is guaranteed. Moreover, to get experience data to achieve a sufficient excitation required for the weight convergence, online and offline algorithms are developed to serve as principled ways to record informative experience data. The equivalence proof demonstrates that the optimal control strategy of the auxiliary system robustly stabilizes the original system without violating input and state constraints. The proofs of system stability and weight convergence are provided. Simulation results reveal the validity of the proposed control framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题