论文标题
Langevin动力学用于自适应逆增强学习随机梯度算法
Langevin Dynamics for Adaptive Inverse Reinforcement Learning of Stochastic Gradient Algorithms
论文作者
论文摘要
逆增强学习(IRL)旨在通过观察其响应(估计或动作)来估计优化代理的奖励功能。当观察到由多个随机梯度剂产生的奖励函数梯度的嘈杂估计值时,本文考虑了IRL。我们提出了一种广义的langevin Dynamics算法,以估计奖励函数$ r(θ)$;具体而言,所得的langevin算法渐近地生成从分布成比例到$ \ exp(r(θ))$的样品。提出的IRL算法使用基于内核的无源学习方案。我们还为IRL构建了多内核无源langevin算法,适用于高维数据。提出的IRL算法的性能在自适应贝叶斯学习,逻辑回归(高维问题)和限制的马尔可夫决策过程的示例中说明了。我们证明使用Martingale平均方法证明了所提出的IRL算法的融合。我们还分析了IRL算法在非平稳环境中的跟踪性能,其中实用程序函数$ r(θ)$跳跃随着时间的推移会随着时间的推移而变化。
Inverse reinforcement learning (IRL) aims to estimate the reward function of optimizing agents by observing their response (estimates or actions). This paper considers IRL when noisy estimates of the gradient of a reward function generated by multiple stochastic gradient agents are observed. We present a generalized Langevin dynamics algorithm to estimate the reward function $R(θ)$; specifically, the resulting Langevin algorithm asymptotically generates samples from the distribution proportional to $\exp(R(θ))$. The proposed IRL algorithms use kernel-based passive learning schemes. We also construct multi-kernel passive Langevin algorithms for IRL which are suitable for high dimensional data. The performance of the proposed IRL algorithms are illustrated on examples in adaptive Bayesian learning, logistic regression (high dimensional problem) and constrained Markov decision processes. We prove weak convergence of the proposed IRL algorithms using martingale averaging methods. We also analyze the tracking performance of the IRL algorithms in non-stationary environments where the utility function $R(θ)$ jump changes over time as a slow Markov chain.