天真的探索对于在线LQR是最佳的

论文标题

天真的探索对于在线LQR是最佳的

Naive Exploration is Optimal for Online LQR

论文作者

Simchowitz, Max, Foster, Dylan J.

论文摘要

我们考虑了线性二次调节器的在线自适应控制的问题，其中真实的系统参数未知。我们证明了新的上限和下限，表明最佳遗憾缩放为$ \widetildeθ（{\ sqrt {d _ {\ MathBf {\ MathBf {u}}^2 D _ {\ Mathbf {x}}}} t}} t}} t}}}}}）输入空间和$ d _ {\ mathbf {x}} $是系统状态的维度。值得注意的是，我们的下限排除了$ \ mathrm {poly}（\ log {} t）$ - 遗憾算法的可能性，这是由于问题的明显强烈凸出而被猜想的。我们的上限是通过$ \ textit {确定性等效控制} $的简单变体来实现的，其中学习者根据最佳控制器选择控制输入，同时注入探索性随机噪声。尽管这种方法被证明可以实现$ \ sqrt {t} $ - （Mania et al.2019）的遗憾，但我们表明，如果学习者不断完善对系统矩阵的估计，则该方法也可以达到最佳维度依赖性。我们的上限和下限的核心是一种新方法，用于控制称为$ \ textIt {自我键入的ode方法} $的Riccati方程的扰动，我们用来从估计的系统动力学中得出确定性等效控制器合成的确定性等效控制器。反过来，这使遗憾的是上限，以$ \ textit {任何可稳定的实例} $，并以自然控制理论数量进行扩展。

We consider the problem of online adaptive control of the linear quadratic regulator, where the true system parameters are unknown. We prove new upper and lower bounds demonstrating that the optimal regret scales as $\widetildeΘ({\sqrt{d_{\mathbf{u}}^2 d_{\mathbf{x}} T}})$, where $T$ is the number of time steps, $d_{\mathbf{u}}$ is the dimension of the input space, and $d_{\mathbf{x}}$ is the dimension of the system state. Notably, our lower bounds rule out the possibility of a $\mathrm{poly}(\log{}T)$-regret algorithm, which had been conjectured due to the apparent strong convexity of the problem. Our upper bound is attained by a simple variant of $\textit{certainty equivalent control}$, where the learner selects control inputs according to the optimal controller for their estimate of the system while injecting exploratory random noise. While this approach was shown to achieve $\sqrt{T}$-regret by (Mania et al. 2019), we show that if the learner continually refines their estimates of the system matrices, the method attains optimal dimension dependence as well. Central to our upper and lower bounds is a new approach for controlling perturbations of Riccati equations called the $\textit{self-bounding ODE method}$, which we use to derive suboptimality bounds for the certainty equivalent controller synthesized from estimated system dynamics. This in turn enables regret upper bounds which hold for $\textit{any stabilizable instance}$ and scale with natural control-theoretic quantities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题