统计上有效的非政策政策梯度

论文标题

统计上有效的非政策政策梯度

Statistically Efficient Off-Policy Policy Gradients

论文作者

Kallus, Nathan, Uehara, Masatoshi

论文摘要

强化学习中的策略梯度方法通过朝估计的策略价值梯度的方向采取步骤来更新策略参数。在本文中，我们考虑了从估计尤其不平凡的统计上对策略梯度的统计有效估计。我们在马尔可夫和非马尔科夫决策过程中都在可行的于点误差上得出了渐近下限，并表明现有估计器在一般设置中未能实现它。我们提出了一个实现下限没有任何参数假设并具有独特的三向双重鲁棒性属性的元算法。我们讨论了如何估计算法所依赖的麻烦。最后，当我们朝着新的估计政策梯度的方向采取措施时，我们确定了固定点的保证。

Policy gradient methods in reinforcement learning update policy parameters by taking steps in the direction of an estimated gradient of policy value. In this paper, we consider the statistically efficient estimation of policy gradients from off-policy data, where the estimation is particularly non-trivial. We derive the asymptotic lower bound on the feasible mean-squared error in both Markov and non-Markov decision processes and show that existing estimators fail to achieve it in general settings. We propose a meta-algorithm that achieves the lower bound without any parametric assumptions and exhibits a unique 3-way double robustness property. We discuss how to estimate nuisances that the algorithm relies on. Finally, we establish guarantees on the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient.

下载PDF全文

下载文献需遵守相关版权规定

论文标题