最佳响应贝叶斯加固学习与贝叶斯自适应pomdps

论文标题

最佳响应贝叶斯加固学习与贝叶斯自适应pomdps

Best-Response Bayesian Reinforcement Learning with Bayes-adaptive POMDPs for Centaurs

论文作者

Çelikok, Mustafa Mert, Oliehoek, Frans A., Kaski, Samuel

论文摘要

半人马是半人类的半ai决策者，AI的目标是补充人类。为此，人工智能必须能够认识到人类的目标和约束，并有帮助他们的手段。我们介绍了人与AI之间的相互作用作为一个连续游戏的新颖表述，其中使用贝叶斯最佳响应模型对代理进行建模。我们表明，在这种情况下，AI帮助有限的人类做出更好的决策的问题减少了自适应的POMDP。在我们的模拟实验中，我们考虑了对人类未来行为主观乐观的人类框架的实例化。我们的结果表明，当配备人类的模型时，AI可以推断人的界限并将其推向更好的决策。我们讨论了机器可以在人类的帮助下学习自身局限性的方法。我们在部分可观察到的任务中确定了半人马的新颖权衡：为了使人工智能的行动被人类接受，机器必须确保他们的信念足够统一，但是一致性的信念可能会昂贵。我们对这种权衡及其对任务结构的依赖性进行了初步理论分析。

Centaurs are half-human, half-AI decision-makers where the AI's goal is to complement the human. To do so, the AI must be able to recognize the goals and constraints of the human and have the means to help them. We present a novel formulation of the interaction between the human and the AI as a sequential game where the agents are modelled using Bayesian best-response models. We show that in this case the AI's problem of helping bounded-rational humans make better decisions reduces to a Bayes-adaptive POMDP. In our simulated experiments, we consider an instantiation of our framework for humans who are subjectively optimistic about the AI's future behaviour. Our results show that when equipped with a model of the human, the AI can infer the human's bounds and nudge them towards better decisions. We discuss ways in which the machine can learn to improve upon its own limitations as well with the help of the human. We identify a novel trade-off for centaurs in partially observable tasks: for the AI's actions to be acceptable to the human, the machine must make sure their beliefs are sufficiently aligned, but aligning beliefs might be costly. We present a preliminary theoretical analysis of this trade-off and its dependence on task structure.

下载PDF全文

下载文献需遵守相关版权规定

论文标题