操纵用于专家迭代中的自我玩法学习经验的分布

论文标题

操纵用于专家迭代中的自我玩法学习经验的分布

Manipulating the Distributions of Experience used for Self-Play Learning in Expert Iteration

论文作者

Soemers, Dennis J. N. J., Piette, Éric, Stephenson, Matthew, Browne, Cameron

论文摘要

专家迭代（退出）是从自我播放中学习游戏策略的有效框架。退出涉及培训政策以模仿树搜索算法的搜索行为 - 例如蒙特卡洛树搜索 - 并使用训练有素的策略来指导它。然后，通过在引导树搜索算法的实例之间收集的自我播放的经验，策略和树搜索可以迭代地相互改进。本文概述了三种不同的方法来操纵从自我播放中收集的数据的分布，以及从收集到的数据中批量学习更新的过程。首先，根据最初经历的发作的持续时间对批量的样品进行加权。其次，在退出框架内应用了优先的经验重播，以优先考虑采样经验，我们希望从中获得有价值的培训信号。第三，训练有素的探索性政策用于使自我播放中经历的轨迹多样化。本文总结了这些操纵对14种不同棋盘游戏中评估的培训表现的影响。我们发现某些游戏的早期训练表现有了重大改进，而在14场比赛中平均进行了较小的改进。

Expert Iteration (ExIt) is an effective framework for learning game-playing policies from self-play. ExIt involves training a policy to mimic the search behaviour of a tree search algorithm - such as Monte-Carlo tree search - and using the trained policy to guide it. The policy and the tree search can then iteratively improve each other, through experience gathered in self-play between instances of the guided tree search algorithm. This paper outlines three different approaches for manipulating the distribution of data collected from self-play, and the procedure that samples batches for learning updates from the collected data. Firstly, samples in batches are weighted based on the durations of the episodes in which they were originally experienced. Secondly, Prioritized Experience Replay is applied within the ExIt framework, to prioritise sampling experience from which we expect to obtain valuable training signals. Thirdly, a trained exploratory policy is used to diversify the trajectories experienced in self-play. This paper summarises the effects of these manipulations on training performance evaluated in fourteen different board games. We find major improvements in early training performance in some games, and minor improvements averaged over fourteen games.

下载PDF全文

下载文献需遵守相关版权规定

论文标题