DNN图运算符设备放置的有效算法

论文标题

DNN图运算符设备放置的有效算法

Efficient Algorithms for Device Placement of DNN Graph Operators

论文作者

Tarnawski, Jakub, Phanishayee, Amar, Devanur, Nikhil R., Mahajan, Divya, Paravecino, Fanny Nina

论文摘要

现代的机器学习工作负载使用具有复杂结构的大型型号，它们执行非常昂贵。执行复杂模型的设备越来越异质，因为除了CPU之外，我们还看到了作为硬件加速器的域特异性加速器的繁荣。这些趋势需要在多个设备上分配工作量。最近的工作表明，可以通过模型并行性，即将神经网络的计算图分配到多个设备上。特别是，这种平行性的形式假定了一条装置管道，该管道被喂食样品流，并产生高通量以训练和推断DNNS。但是，对于此类设置（大型型号和多个异质设备），我们需要自动化算法和工具链，这些算法和工具链可以跨设备划分ML工作负载。在本文中，我们在DNN运算符的设备放置核心（尤其是在现代管道的设置中）确定并隔离了DNN运算符设备放置的核心的结构化优化问题。然后，我们提供将此问题解决为最佳性的算法。我们使用几个当代DNN计算图展示了我们方法的适用性和效率。

Modern machine learning workloads use large models, with complex structures, that are very expensive to execute. The devices that execute complex models are becoming increasingly heterogeneous as we see a flourishing of domain-specific accelerators being offered as hardware accelerators in addition to CPUs. These trends necessitate distributing the workload across multiple devices. Recent work has shown that significant gains can be obtained with model parallelism, i.e, partitioning a neural network's computational graph onto multiple devices. In particular, this form of parallelism assumes a pipeline of devices, which is fed a stream of samples and yields high throughput for training and inference of DNNs. However, for such settings (large models and multiple heterogeneous devices), we require automated algorithms and toolchains that can partition the ML workload across devices. In this paper, we identify and isolate the structured optimization problem at the core of device placement of DNN operators, for both inference and training, especially in modern pipelined settings. We then provide algorithms that solve this problem to optimality. We demonstrate the applicability and efficiency of our approaches using several contemporary DNN computation graphs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题