针对路由问题的样本高效，基于探索的政策优化

论文标题

针对路由问题的样本高效，基于探索的政策优化

Sample-Efficient, Exploration-Based Policy Optimisation for Routing Problems

论文作者

Sultana, Nasrin, Chan, Jeffrey, Sarwar, Tabinda, Qin, A. K.

论文摘要

基于无模型的深层学习算法已应用于一系列COPS〜 \ CITE {BELLO2016neural}〜\ cite {Kool2018 antate}〜\ cite {nazari20182018 Reininforcement}。但是，当应用于组合问题时，这些方法面临两个关键挑战：探索不足和许多搜索空间的培训示例的要求以实现合理的绩效。组合优化可以很复杂，其特征是搜索空间，具有许多Optimas和较大的搜索和学习空间。因此，需要一种新的方法来找到通过更有效的样本效率来更有效的良好解决方案。本文提出了一种基于熵的新增强学习方法。此外，我们设计了一种基于政策的增强学习技术，可最大程度地提高预期回报，并提高样本效率，以在训练时间内实现更快的学习。我们会系统地评估通常用于评估基于学习的优化的一系列路线优化任务，例如旅行推销员问题（TSP），电容性车辆路由问题（CVRP）。在本文中，我们表明我们的模型可以概括为各种路线问题，例如分配交换VRP（SDVRP），并将我们方法的性能与当前最新方法的性能进行比较。经验结果表明，提出的方法可以在解决方案质量和计算时间方面改善最先进的方法，并推广到不同尺寸的问题。

Model-free deep-reinforcement-based learning algorithms have been applied to a range of COPs~\cite{bello2016neural}~\cite{kool2018attention}~\cite{nazari2018reinforcement}. However, these approaches suffer from two key challenges when applied to combinatorial problems: insufficient exploration and the requirement of many training examples of the search space to achieve reasonable performance. Combinatorial optimisation can be complex, characterised by search spaces with many optimas and large spaces to search and learn. Therefore, a new method is needed to find good solutions that are more efficient by being more sample efficient. This paper presents a new reinforcement learning approach that is based on entropy. In addition, we design an off-policy-based reinforcement learning technique that maximises the expected return and improves the sample efficiency to achieve faster learning during training time. We systematically evaluate our approach on a range of route optimisation tasks typically used to evaluate learning-based optimisation, such as the such as the Travelling Salesman problems (TSP), Capacitated Vehicle Routing Problem (CVRP). In this paper, we show that our model can generalise to various route problems, such as the split-delivery VRP (SDVRP), and compare the performance of our method with that of current state-of-the-art approaches. The Empirical results show that the proposed method can improve on state-of-the-art methods in terms of solution quality and computation time and generalise to problems of different sizes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题