双重同步价值迭代：在动作中使价值迭代异步

论文标题

双重同步价值迭代：在动作中使价值迭代异步

Doubly-Asynchronous Value Iteration: Making Value Iteration Asynchronous in Actions

论文作者

Tian, Tian, Young, Kenny, Sutton, Richard S.

论文摘要

价值迭代（VI）是一种基础动态编程方法，对于最佳控制和强化学习的学习和计划很重要。 VI分批进行，其中必须完成对每个状态值的更新，然后才能开始下一批更新。如果状态空间较大，完成单批次的昂贵，那么对于许多应用来说，VI是不切实际的。异步VI通过一次，就地和任意顺序更新一个状态来帮助解决大型状态空间问题。但是，异步VI仍然需要在整个动作空间上最大化，这对于具有较大动作空间的域而言是不切实际的。为了解决这个问题，我们提出了双重同步价值迭代（DAVI），这是一种新算法，概括了从国家到州和行动的异步概念。更具体地说，Davi在一个可以具有任何用户定义尺寸的动作子集上最大化。使用采样来减少计算的这种简单方法使VI具有类似吸引人的理论属性，而无需等待每个更新中的整个动作空间进行全面扫描。在本文中，我们显示了DAVI收敛到概率上的最佳值函数，以接近几何的速率与概率1-Delta收敛，并在计算时间中返回近乎最佳的策略，该策略几乎与先前建立的VI绑定相匹配。我们还经验证明了Davi在几个实验中的有效性。

Value iteration (VI) is a foundational dynamic programming method, important for learning and planning in optimal control and reinforcement learning. VI proceeds in batches, where the update to the value of each state must be completed before the next batch of updates can begin. Completing a single batch is prohibitively expensive if the state space is large, rendering VI impractical for many applications. Asynchronous VI helps to address the large state space problem by updating one state at a time, in-place and in an arbitrary order. However, Asynchronous VI still requires a maximization over the entire action space, making it impractical for domains with large action space. To address this issue, we propose doubly-asynchronous value iteration (DAVI), a new algorithm that generalizes the idea of asynchrony from states to states and actions. More concretely, DAVI maximizes over a sampled subset of actions that can be of any user-defined size. This simple approach of using sampling to reduce computation maintains similarly appealing theoretical properties to VI without the need to wait for a full sweep through the entire action space in each update. In this paper, we show DAVI converges to the optimal value function with probability one, converges at a near-geometric rate with probability 1-delta, and returns a near-optimal policy in computation time that nearly matches a previously established bound for VI. We also empirically demonstrate DAVI's effectiveness in several experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题