随机用户到达的移动边缘计算时间安排：大约MDP和强化学习方法

论文标题

随机用户到达的移动边缘计算时间安排：大约MDP和强化学习方法

Scheduling for Mobile Edge Computing with Random User Arrivals: An Approximate MDP and Reinforcement Learning Approach

论文作者

Huang, Shanfeng, Lv, Bojie, Wang, Rui, Huang, Kaibin

论文摘要

在本文中，我们调查了移动边缘计算（MEC）系统的调度设计，其中具有计算任务的主动移动设备随机出现在单元格中。可以在移动设备或MEC服务器上计算每个任务。我们通过将问题作为无限 - 马尔可夫决策过程（MDP）提出，共同优化了任务卸载决策，上行链路传输设备选择和电力分配。与大多数现有文献相比，这是第一次尝试以无限时间范围内随机设备到达的传输和计算优化。由于设备编号和位置的不确定性，因此无法应用针对维度诅咒的常规近似MDP方法。在这项工作中提出了一种替代且合适的低复杂解决方案框架。我们首先引入基线调度策略，其价值函数可以通过随机移动设备到达的统计信息进行分析得出。然后，采用一步策略迭代以获得一项次优的调度策略，其绩效可以通过分析进行界定。与MDP的常规解决方案相比，通过消除复杂的价值迭代而大大降低了次优政策的复杂性。为了解决一个更一般的方案，即随机移动设备到达的统计数据未知，提出了一种新颖而有效的算法整合增强学习和随机梯度下降（SGD），以在线方式提高系统性能。仿真结果表明，在各种基准测试中，亚最佳策略的增益非常重要。

In this paper, we investigate the scheduling design of a mobile edge computing (MEC) system, where active mobile devices with computation tasks randomly appear in a cell. Every task can be computed at either the mobile device or the MEC server. We jointly optimize the task offloading decision, uplink transmission device selection and power allocation by formulating the problem as an infinite-horizon Markov decision process (MDP). Compared with most of the existing literature, this is the first attempt to address the transmission and computation optimization with the random device arrivals in an infinite time horizon to our best knowledge. Due to the uncertainty in the device number and location, the conventional approximate MDP approaches addressing the curse of dimensionality cannot be applied. An alternative and suitable low-complexity solution framework is proposed in this work. We first introduce a baseline scheduling policy, whose value function can be derived analytically with the statistics of random mobile device arrivals. Then, one-step policy iteration is adopted to obtain a sub-optimal scheduling policy whose performance can be bounded analytically. The complexity of deriving the sub-optimal policy is reduced dramatically compared with conventional solutions of MDP by eliminating the complicated value iteration. To address a more general scenario where the statistics of random mobile device arrivals are unknown, a novel and efficient algorithm integrating reinforcement learning and stochastic gradient descent (SGD) is proposed to improve the system performance in an online manner. Simulation results show that the gain of the sub-optimal policy over various benchmarks is significant.

下载PDF全文

下载文献需遵守相关版权规定

论文标题