NPU加速模仿学习，用于QoS受限的异质多核的热优化

论文标题

NPU加速模仿学习，用于QoS受限的异质多核的热优化

NPU-Accelerated Imitation Learning for Thermal Optimization of QoS-Constrained Heterogeneous Multi-Cores

论文作者

Rapp, Martin, Khdr, Heba, Krohmer, Nikita, Henkel, Jörg

论文摘要

应用程序迁移，动态电压和频率缩放（DVF）是必不可少的手段，用于完全利用在用户定义的服务质量（QOS）目标下的异质群集多核处理器的热优化的可用潜力。但是，选择核心来执行每个应用程序，每个集群的电压/频率（v/f）级别是一个复杂的问题，因为1）应用程序的多种特征和QoS目标需要不同的优化，并且2）每个集群DVFS需要全局优化，考虑所有运行应用程序。用于功率或温度最小化的最新资源管理技术要么取决于通常不可用的测量值（例如功率），要么无法考虑问题的所有维度（例如，使用简化的分析模型）。模仿学习（IL）使通过训练Oracle演示的模型来利用Oracle策略的最优性，但在低运行时开销。我们是第一个在QoS目标下使用IL来最小化的IL的人。我们通过训练神经网络（NN）来解决复杂性，并使用神经加工单元（NPU）加速NN推断。尽管此类NN加速器在最终设备上变得越来越普遍，但到目前为止，它们仅用于加速用户应用程序。相比之下，我们在真实平台上使用现有的加速器来加速基于NN的资源管理。我们在带有手臂的远程970板上的评估。LittleCPU和NPU在可忽略不计的运行时间开销时显示出明显的温度降低，没有看不见的应用，并且与训练相比不同。

Application migration and dynamic voltage and frequency scaling (DVFS) are indispensable means for fully exploiting the available potential in thermal optimization of a heterogeneous clustered multi-core processor under user-defined quality of service (QoS) targets. However, selecting the core to execute each application and the voltage/frequency (V/f) levels of each cluster is a complex problem because 1) the diverse characteristics and QoS targets of applications require different optimizations, and 2) per-cluster DVFS requires a global optimization considering all running applications. State-of-the-art resource management techniques for power or temperature minimization either rely on measurements that are often not available (such as power) or fail to consider all the dimensions of the problem (e.g., by using simplified analytical models). Imitation learning (IL) enables to use the optimality of an oracle policy, yet at low run-time overhead, by training a model from oracle demonstrations. We are the first to employ IL for temperature minimization under QoS targets. We tackle the complexity by training a neural network (NN) and accelerate the NN inference using a neural processing unit (NPU). While such NN accelerators are becoming increasingly widespread on end devices, they are so far only used to accelerate user applications. In contrast, we use an existing accelerator on a real platform to accelerate NN-based resource management. Our evaluation on a HiKey 970 board with an Arm big.LITTLE CPU and an NPU shows significant temperature reductions at a negligible run-time overhead, with unseen applications and different cooling than used for training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题