论文标题
与日志线性策略参数化的自然政策梯度的线性收敛
Linear Convergence for Natural Policy Gradient with Log-linear Policy Parametrization
论文作者
论文摘要
我们分析了不规则的自然政策梯度算法的收敛速率,并在无限 - 霍尼折扣的马尔可夫决策过程中使用对数线性策略参数化。在确定性的情况下,当已知Q值并通过已知特征函数的线性组合近似为偏差误差时,我们表明几何增强的步骤尺寸会产生线性收敛速率,以达到最佳策略。然后,我们考虑基于样本的情况,当已知特征函数的线性组合中Q值函数的最佳表示已知估计误差。在这种情况下,我们表明该算法具有与确定性情况相同的线性保证,直到误差项取决于估计误差,偏差误差和特征协方差矩阵的条件编号。我们的结果建立在策略镜下降的一般框架上,并将SoftMax表格参数化的先前发现扩展到日志线性策略类。
We analyze the convergence rate of the unregularized natural policy gradient algorithm with log-linear policy parametrizations in infinite-horizon discounted Markov decision processes. In the deterministic case, when the Q-value is known and can be approximated by a linear combination of a known feature function up to a bias error, we show that a geometrically-increasing step size yields a linear convergence rate towards an optimal policy. We then consider the sample-based case, when the best representation of the Q- value function among linear combinations of a known feature function is known up to an estimation error. In this setting, we show that the algorithm enjoys the same linear guarantees as in the deterministic case up to an error term that depends on the estimation error, the bias error, and the condition number of the feature covariance matrix. Our results build upon the general framework of policy mirror descent and extend previous findings for the softmax tabular parametrization to the log-linear policy class.