论文标题
Pavlovian信号传导具有一般价值的函数在代理代理的时间决策中
Pavlovian Signalling with General Value Functions in Agent-Agent Temporal Decision Making
论文作者
论文摘要
在本文中,我们为Pavlovian信号传导贡献了一项多方面的研究,这是一个过程,该过程通过一个代理商为另一个代理商提供了一项代理商的决策,这一过程是由另一个代理商提出的。信号与时间和时机密切相关。为了服务和接收信号,已知人类和其他动物代表时间,确定过去事件以来的时间,预测未来刺激的时间,并识别和生成及时展开的模式。我们通过引入一个可观察到的决策域,我们称之为霜冻空心,研究了不同的时间过程如何影响学习剂之间的协调和信号传导。在这个领域中,预测学习者和强化学习代理人耦合到一个由两部分的决策系统中,该系统可在避免时间条件危害的同时获得稀疏的奖励。我们评估了两个领域的变化:在七态线性步行中相互作用的机器代理,以及在虚拟现实环境中的人机相互作用。我们的结果展示了Pavlovian信号的学习速度,不同的时间表示(也不)对代理机构协调产生的影响,以及时间混叠如何影响代理人和人类代理人的相互作用。作为主要贡献,我们将帕夫洛夫信号建立为固定信号范式和两个代理之间完全自适应的通信学习之间的自然桥。我们进一步展示了如何从固定信号过程中构建这种自适应信号传导过程,其特征是快速持续的预测学习和对接收信号的性质的最小约束。因此,我们的结果表明,在强化学习者之间进行沟通学习的可行的建构主义途径。
In this paper, we contribute a multi-faceted study into Pavlovian signalling -- a process by which learned, temporally extended predictions made by one agent inform decision-making by another agent. Signalling is intimately connected to time and timing. In service of generating and receiving signals, humans and other animals are known to represent time, determine time since past events, predict the time until a future stimulus, and both recognize and generate patterns that unfold in time. We investigate how different temporal processes impact coordination and signalling between learning agents by introducing a partially observable decision-making domain we call the Frost Hollow. In this domain, a prediction learning agent and a reinforcement learning agent are coupled into a two-part decision-making system that works to acquire sparse reward while avoiding time-conditional hazards. We evaluate two domain variations: machine agents interacting in a seven-state linear walk, and human-machine interaction in a virtual-reality environment. Our results showcase the speed of learning for Pavlovian signalling, the impact that different temporal representations do (and do not) have on agent-agent coordination, and how temporal aliasing impacts agent-agent and human-agent interactions differently. As a main contribution, we establish Pavlovian signalling as a natural bridge between fixed signalling paradigms and fully adaptive communication learning between two agents. We further show how to computationally build this adaptive signalling process out of a fixed signalling process, characterized by fast continual prediction learning and minimal constraints on the nature of the agent receiving signals. Our results therefore suggest an actionable, constructivist path towards communication learning between reinforcement learning agents.