在无限的地平线设置中的增强学习值的统计推断

论文标题

在无限的地平线设置中的增强学习值的统计推断

Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings

论文作者

Shi, C., Zhang, S., Lu, W., Song, R.

论文摘要

强化学习是一种通用技术，它允许代理商学习最佳政策并与依次决策问题中的环境互动。策略的优点是通过从某种初始状态开始的价值函数来衡量的。本文的重点是在无限的地平线设置中构建置信区间（CI），而决策点的数量与无穷大。我们建议建模与基于串联/Sieve方法相关的策略相关的动作值态函数（Q-功能），以得出其置信区间。当目标策略也取决于观察到的数据时，我们提出了一种顺序值评估（保存）方法，以递归更新估计的策略及其价值估计器。只要轨迹的数量或决策点的数量与无穷大分歧，我们表明，即使在最佳策略不是唯一的情况下，提议的CI也可以达到名义覆盖。进行仿真研究以支持我们的理论发现。我们将提出的方法应用于移动健康研究的数据集，并发现强化学习算法可以帮助改善患者的健康状况。拟议程序的Python实现可在https://github.com/shengzhang37/save上获得。

Reinforcement learning is a general technique that allows an agent to learn an optimal policy and interact with an environment in sequential decision making problems. The goodness of a policy is measured by its value function starting from some initial state. The focus of this paper is to construct confidence intervals (CIs) for a policy's value in infinite horizon settings where the number of decision points diverges to infinity. We propose to model the action-value state function (Q-function) associated with a policy based on series/sieve method to derive its confidence interval. When the target policy depends on the observed data as well, we propose a SequentiAl Value Evaluation (SAVE) method to recursively update the estimated policy and its value estimator. As long as either the number of trajectories or the number of decision points diverges to infinity, we show that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique. Simulation studies are conducted to back up our theoretical findings. We apply the proposed method to a dataset from mobile health studies and find that reinforcement learning algorithms could help improve patient's health status. A Python implementation of the proposed procedure is available at https://github.com/shengzhang37/SAVE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题