连续演奏的增强赛车游戏：将SPG与PPO进行比较

论文标题

连续演奏的增强赛车游戏：将SPG与PPO进行比较

Continuous-action Reinforcement Learning for Playing Racing Games: Comparing SPG to PPO

论文作者

Holubar, Mario S., Wiering, Marco A.

论文摘要

在本文中，引入了一个新型的OpenAI体育馆赛车环境。该环境以连续的动作和状态空间运行，并要求代理在导航随机生成的赛马场时学习控制汽车的加速度和转向。在此环境上测试了两种参与者 - 批判性学习算法的不同版本：采样策略梯度（SPG）和近端政策优化（PPO）。引入了SPG的扩展，旨在通过在政策更新步骤中加权动作样本来提高学习绩效。还研究了使用经验重播（ER）的效果。为此，引入了对PPO的修改，该修改允许通过在日志空间中优化演员来使用旧动作样本进行训练。最后，测试了一种用于执行ER的新技术，旨在通过将培训分为两个部分而牺牲绩效，从而提高学习速度，从而首先使用重播缓冲区的状态过渡对网络进行培训，然后仅使用最近的经验。结果表明，在连续的动作空间中，体验重播对PPO无益。当动作加权时，SPG的训练似乎更稳定。使用ER时，所有版本的SPG的表现都优于PPO。急诊素的技巧可有效提高在计算较少的SPG上的训练速度。

In this paper, a novel racing environment for OpenAI Gym is introduced. This environment operates with continuous action- and state-spaces and requires agents to learn to control the acceleration and steering of a car while navigating a randomly generated racetrack. Different versions of two actor-critic learning algorithms are tested on this environment: Sampled Policy Gradient (SPG) and Proximal Policy Optimization (PPO). An extension of SPG is introduced that aims to improve learning performance by weighting action samples during the policy update step. The effect of using experience replay (ER) is also investigated. To this end, a modification to PPO is introduced that allows for training using old action samples by optimizing the actor in log space. Finally, a new technique for performing ER is tested that aims to improve learning speed without sacrificing performance by splitting the training into two parts, whereby networks are first trained using state transitions from the replay buffer, and then using only recent experiences. The results indicate that experience replay is not beneficial to PPO in continuous action spaces. The training of SPG seems to be more stable when actions are weighted. All versions of SPG outperform PPO when ER is used. The ER trick is effective at improving training speed on a computationally less intensive version of SPG.

下载PDF全文

下载文献需遵守相关版权规定

论文标题