通过多目标土匪混合控制器

论文标题

通过多目标土匪混合控制器

Blending Controllers via Multi-Objective Bandits

论文作者

Gohari, Parham, Djeumou, Franck, Vinod, Abraham P., Topcu, Ufuk

论文摘要

安全性和绩效通常是顺序决策问题中的两个相互竞争的目标。现有的性能控制器，例如从加固学习算法中得出的控制器，通常没有安全保证。相反，保证安全的控制器（例如从经典控制理论中得出的安全）需要限制性假设，并且在绩效上通常是保守的。我们的目标是将表演者和安全控制器混合在一起，以生成比表演者更安全的单个控制器，并且比安全控制器更高的奖励。为此，我们使用上下文多臂多武器匪徒的框架提出了一种混合算法。在每个阶段，该算法都以立即的奖励和成本观察环境的当前环境，这是基本的安全措施。然后，该算法根据其观察结果决定要采用哪种控制器。我们证明，该算法实现了Sublinear Pareto的遗憾，这是一种与专家建模一致性的性能指标，该措施总是避免以劣质的安全性和性能来挑选控制器。我们在单个目标的损失上得出了上限，这不会施加其他计算复杂性。我们从经验上证明了该算法在将安全的测试台（安全健身房环境）中融合安全和表现控制器方面的成功。对混合控制器的总奖励和成本进行的统计分析反映了两个关键要点：混合控制器与安全控制器相比表现出严格的性能改善，并且比表演者控制器更安全。

Safety and performance are often two competing objectives in sequential decision-making problems. Existing performant controllers, such as controllers derived from reinforcement learning algorithms, often fall short of safety guarantees. On the contrary, controllers that guarantee safety, such as those derived from classical control theory, require restrictive assumptions and are often conservative in performance. Our goal is to blend a performant and a safe controller to generate a single controller that is safer than the performant and accumulates higher rewards than the safe controller. To this end, we propose a blending algorithm using the framework of contextual multi-armed multi-objective bandits. At each stage, the algorithm observes the environment's current context alongside an immediate reward and cost, which is the underlying safety measure. The algorithm then decides which controller to employ based on its observations. We demonstrate that the algorithm achieves sublinear Pareto regret, a performance measure that models coherence with an expert that always avoids picking the controller with both inferior safety and performance. We derive an upper bound on the loss in individual objectives, which imposes no additional computational complexity. We empirically demonstrate the algorithm's success in blending a safe and a performant controller in a safety-focused testbed, the Safety Gym environment. A statistical analysis of the blended controller's total reward and cost reflects two key takeaways: The blended controller shows a strict improvement in performance compared to the safe controller, and it is safer than the performant controller.

下载PDF全文

下载文献需遵守相关版权规定

论文标题