论文标题
随机和对抗性MDP中的合作在线学习
Cooperative Online Learning in Stochastic and Adversarial MDPs
论文作者
论文摘要
我们在随机和对抗性马尔可夫决策过程(MDP)中研究合作在线学习。也就是说,在每一集中,$ m $ agents同时与MDP互动,并共享信息以最大程度地减少他们的遗憾。我们考虑具有两种随机性的环境:\ emph {Fresh} - 在每个代理的轨迹均被采样I.I.D和\ Emph {non-Fresh} - ,其中所有试剂都共享实现的实现(但是每个代理的轨迹也会受到其自身的作用的影响)。更确切地说,通过非新鲜的随机性,每个成本和过渡的实现在每个情节的开始都固定了,并且在同一时间同时采取相同行动的代理人观察到相同的成本和下一个状态。我们彻底分析了所有相关设置,强调了模型之间的挑战和差异,并证明了几乎匹配的遗憾下层和上限。据我们所知,我们是第一个考虑具有非伪造随机性或对抗性MDP的合作加强学习(RL)的人。
We study cooperative online learning in stochastic and adversarial Markov decision process (MDP). That is, in each episode, $m$ agents interact with an MDP simultaneously and share information in order to minimize their individual regret. We consider environments with two types of randomness: \emph{fresh} -- where each agent's trajectory is sampled i.i.d, and \emph{non-fresh} -- where the realization is shared by all agents (but each agent's trajectory is also affected by its own actions). More precisely, with non-fresh randomness the realization of every cost and transition is fixed at the start of each episode, and agents that take the same action in the same state at the same time observe the same cost and next state. We thoroughly analyze all relevant settings, highlight the challenges and differences between the models, and prove nearly-matching regret lower and upper bounds. To our knowledge, we are the first to consider cooperative reinforcement learning (RL) with either non-fresh randomness or in adversarial MDPs.