融合缝合：提高记忆密集型计算的深度学习工作负载

论文标题

融合缝合：提高记忆密集型计算的深度学习工作负载

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

论文作者

Zheng, Zhen, Zhao, Pengzhan, Long, Guoping, Zhu, Feiwen, Zhu, Kai, Zhao, Wenyi, Diao, Lansong, Yang, Jun, Lin, Wei

论文摘要

我们在这项工作中表明，由于芯片内存储器访问和CPU-GPU上下文在广泛的深度学习模型中，内存密集型计算可能导致严重的性能问题。对于这个问题，当前的及时（JIT）内核融合和代码生成技术具有局限性，例如粗糙的融合计划探索策略和有限的代码生成能力。我们提出了融合缝隙，这是一个能够融合内存密集运算符的深度学习编译器，具有多样化的数据依赖性和非均匀的并行性，将其融合到大型GPU内核中，以自动减少全局内存访问和上下文切换开销。 FusionStitching扩大了融合可以通过引入中间值的数据重复使用以外的JIT超越以前的JIT作用的操作组合范围。它探索了较大的融合空间，以确定最佳的融合计划，并考虑内存访问成本，内核调用和资源使用限制。 FusionStitching用特定于域的成本模型有效地调整了最佳缝合方案。实验结果表明，与最先进的融合缝合可以达到2.21倍的速度，平均为1.45倍。除了这些实验结果外，我们还将方法集成到编译器产品中，并将其部署到生产群集上，以进行数千种GPU的AI工作负载。该系统已经运行了4个月以上，平均节省了7,000个GPU小时，每月约30,000个任务。

We show in this work that memory intensive computations can result in severe performance problems due to off-chip memory access and CPU-GPU context switch overheads in a wide range of deep learning models. For this problem, current just-in-time (JIT) kernel fusion and code generation techniques have limitations, such as rough fusion plan exploration strategies and limited code generation ability. We propose FusionStitching, a deep learning compiler capable of fusing memory intensive operators, with varied data dependencies and non-homogeneous parallelism, into large GPU kernels to reduce global memory access and context switch overhead automatically. FusionStitching widens the range of operation combinations that fusion can target beyond previous JIT works by introducing data reuse of intermediate values. It explores large fusion spaces to decide optimal fusion plans with considerations of memory access costs, kernel calls and resource usage constraints. FusionStitching tunes the optimal stitching scheme with a domain-specific cost model efficiently. Experimental results show that FusionStitching can reach up to 2.21x speedup compared to state-of-the-art, with 1.45x on average. Besides these experimental results, we integrated our approach into a compiler product and deployed it onto a production cluster for AI workloads with thousands of GPUs. The system has been in operation for more than 4 months and saves 7,000 GPU hours on average for approximately 30,000 tasks per month.

下载PDF全文

下载文献需遵守相关版权规定

论文标题