论文标题
寻找可以实现更快计划的选项
Toward Discovering Options that Achieve Faster Planning
论文作者
论文摘要
我们提出了一个新的目标发现目标,该目标强调了在计划中使用选项的计算优势。在连续机器中,计划速度与用于实现良好政策的基本操作数量成正比。对于情节任务,基本操作的数量取决于策略在情节中组成的选项数量以及每个决策点要考虑的选项数量。为了减少规划中的计算量,对于给定的一组情节任务和给定数量的选项,我们的目标更喜欢通过撰写几乎没有选项来实现高回报的选项,并且也更喜欢在每个决策点上选择较小的选项。我们开发了一种优化拟议目标的算法。在经典四室域的一个变化中,我们表明1)较高的客观价值通常与选项值迭代算法使用的基本计划操作数量较少相关,以获得近乎最佳的价值功能,2)我们的算法通过两次人符号的选项与选项相匹配的目标价值3)的算法实现的目标3)与人类设计的选项相匹配,4)我们算法产生的选项也具有直观的意义 - 它们似乎在房间的入口处移动并终止。
We propose a new objective for option discovery that emphasizes the computational advantage of using options in planning. In a sequential machine, the speed of planning is proportional to the number of elementary operations used to achieve a good policy. For episodic tasks, the number of elementary operations depends on the number of options composed by the policy in an episode and the number of options being considered at each decision point. To reduce the amount of computation in planning, for a given set of episodic tasks and a given number of options, our objective prefers options with which it is possible to achieve a high return by composing few options, and also prefers a smaller set of options to choose from at each decision point. We develop an algorithm that optimizes the proposed objective. In a variant of the classic four-room domain, we show that 1) a higher objective value is typically associated with fewer number of elementary planning operations used by the option-value iteration algorithm to obtain a near-optimal value function, 2) our algorithm achieves an objective value that matches it achieved by two human-designed options 3) the amount of computation used by option-value iteration with options discovered by our algorithm matches it with the human-designed options, 4) the options produced by our algorithm also make intuitive sense--they seem to move to and terminate at the entrances of rooms.