论文标题
线性二次调节器的结构化政策迭代
Structured Policy Iteration for Linear Quadratic Regulator
论文作者
论文摘要
线性二次调节器(LQR)是解决连续马尔可夫决策过程任务的最受欢迎的框架之一。借助其基本理论和可牵引的最佳政策,LQR在近年来对诸如无模型或基于模型的设置之类的强化学习场景进行了重新审视和分析。在本文中,我们介绍了LQR的\ textit {结构化策略迭代}(S-PI),该方法能够得出结构化的线性策略。这种结构化的政策具有(块)稀疏性或低级别的策略,可以比标准LQR策略具有显着优势:更容易解释,记忆效率和适合分布式设置。为了得出这样的策略,我们首先在已知模型时首先提出正规化的LQR问题。然后,我们的结构化政策迭代(S-PI)算法以迭代方式采取了政策评估步骤和政策改进步骤,可以有效地解决此正则化LQR。我们进一步将S-PI算法扩展到采用平滑过程以估计梯度的无模型设置。在已知模型和无模型设置中,我们在适当的参数选择下证明了收敛分析。最后,实验证明了S-PI通过改变权重参数来平衡LQR性能和结构水平的优势。
Linear quadratic regulator (LQR) is one of the most popular frameworks to tackle continuous Markov decision process tasks. With its fundamental theory and tractable optimal policy, LQR has been revisited and analyzed in recent years, in terms of reinforcement learning scenarios such as the model-free or model-based setting. In this paper, we introduce the \textit{Structured Policy Iteration} (S-PI) for LQR, a method capable of deriving a structured linear policy. Such a structured policy with (block) sparsity or low-rank can have significant advantages over the standard LQR policy: more interpretable, memory-efficient, and well-suited for the distributed setting. In order to derive such a policy, we first cast a regularized LQR problem when the model is known. Then, our Structured Policy Iteration (S-PI) algorithm, which takes a policy evaluation step and a policy improvement step in an iterative manner, can solve this regularized LQR efficiently. We further extend the S-PI algorithm to the model-free setting where a smoothing procedure is adopted to estimate the gradient. In both the known-model and model-free setting, we prove convergence analysis under the proper choice of parameters. Finally, the experiments demonstrate the advantages of S-PI in terms of balancing the LQR performance and level of structure by varying the weight parameter.