间隔马尔可夫的决策过程具有连续的动作空间

论文标题

间隔马尔可夫的决策过程具有连续的动作空间

Interval Markov Decision Processes with Continuous Action-Spaces

论文作者

Delimpaltadakis, Giannis, Lahijanian, Morteza, Mazo Jr., Manuel, Laurenti, Luca

论文摘要

间隔马尔可夫决策过程（IMDP）是有限状态不确定的马尔可夫模型，在该模型中，过渡概率属于间隔。最近，关于使用IMDP作为控制合成的随机系统抽象的研究激增。但是，由于没有具有连续动作空间的IMDP的合成算法，因此假定动作空间是离散的A-Priori，这是许多应用程序的限制性假设。在此激励的情况下，我们引入了连续行动IMDP（CAIMDPS），其中过渡概率的界限是动作变量的函数，并研究价值迭代以最大程度地提高预期的累积奖励。具体来说，我们将与迭代相关的最大值问题分解为$ | \ nathcal {q} | $ max问题，其中$ | \ Mathcal {q} | $是CAIMDP的状态数。然后，利用这些最大问题的简单形式，我们确定可以有效地解决比CAIMDP的迭代的情况（例如，使用线性或凸编程）。我们还获得了其他有趣的见解：例如，在某些情况下，操作设置$ \ MATHCAL {a} $是多层，而不是离散行动IMDP的合成，其中这些操作是$ \ Mathcal {a} $的顶点，就足够了。我们在一个数字示例上演示了我们的结果。最后，我们包括有关使用CAIMDP作为控制合成的抽象的简短讨论。

Interval Markov Decision Processes (IMDPs) are finite-state uncertain Markov models, where the transition probabilities belong to intervals. Recently, there has been a surge of research on employing IMDPs as abstractions of stochastic systems for control synthesis. However, due to the absence of algorithms for synthesis over IMDPs with continuous action-spaces, the action-space is assumed discrete a-priori, which is a restrictive assumption for many applications. Motivated by this, we introduce continuous-action IMDPs (caIMDPs), where the bounds on transition probabilities are functions of the action variables, and study value iteration for maximizing expected cumulative rewards. Specifically, we decompose the max-min problem associated to value iteration to $|\mathcal{Q}|$ max problems, where $|\mathcal{Q}|$ is the number of states of the caIMDP. Then, exploiting the simple form of these max problems, we identify cases where value iteration over caIMDPs can be solved efficiently (e.g., with linear or convex programming). We also gain other interesting insights: e.g., in certain cases where the action set $\mathcal{A}$ is a polytope, synthesis over a discrete-action IMDP, where the actions are the vertices of $\mathcal{A}$, is sufficient for optimality. We demonstrate our results on a numerical example. Finally, we include a short discussion on employing caIMDPs as abstractions for control synthesis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题