论文标题
参数评论家:通过不可分割的样本减少模型的降低方法
Parameter Critic: a Model Free Variance Reduction Method Through Imperishable Samples
论文作者
论文摘要
我们考虑找到一项政策的问题,该政策在与未知环境相互作用的代理商的整个轨迹中最大化预期奖励。在学习过程的每个步骤中,这个框架通常表示强化学习,这框架需要大量样本。为此,我们介绍了参数评论家,该公式即使策略的参数更改,允许样本可以保持其有效性。特别是,我们建议使用函数近似器直接了解参数与预期累积奖励之间的关系。通过收敛分析,我们证明参数评论家的表现优于无梯度参数空间探索技术,因为它对噪声是可靠的。从经验上讲,我们表明我们的方法解决了卡特柱问题,该问题证实了我们的主张,因为代理人可以在学习参数与累积奖励之间的关系时成功学习最佳政策。
We consider the problem of finding a policy that maximizes an expected reward throughout the trajectory of an agent that interacts with an unknown environment. Frequently denoted Reinforcement Learning, this framework suffers from the need of large amount of samples in each step of the learning process. To this end, we introduce parameter critic, a formulation that allows samples to keep their validity even when the parameters of the policy change. In particular, we propose the use of a function approximator to directly learn the relationship between the parameters and the expected cumulative reward. Through convergence analysis, we demonstrate the parameter critic outperforms gradient-free parameter space exploration techniques as it is robust to noise. Empirically, we show that our method solves the cartpole problem which corroborates our claim as the agent can successfully learn an optimal policy while learning the relationship between the parameters and the cumulative reward.