深度学习绩效分析的基于时间的车顶线

论文标题

深度学习绩效分析的基于时间的车顶线

Time-Based Roofline for Deep Learning Performance Analysis

论文作者

Wang, Yunsong, Yang, Charlene, Farrell, Steven, Zhang, Yan, Kurth, Thorsten, Williams, Samuel

论文摘要

深度学习应用程序通常非常密集，需要长时间的培训和推理。这是由硬件和软件方面的研究人员解决的，在本文中，我们提出了一种基于屋顶线的性能分析方法，以促进这些应用程序的优化。这种方法是在传统的高性能计算应用中广泛使用的车顶线模型的扩展，它既包含计算/带宽的复杂性，又包含了其公式的运行时间，以提供深入的特定于学习特定特征的见解。我们采用两组代表性内核：2D卷积和长期的短期记忆，以验证和证明这种新方法的使用，并研究算术强度，缓存局部性，自动调节，内核启动开销以及张力核心使用情况如何影响性能。与常见的临时方法相比，这项研究有助于形成一种更系统的方法来分析代码性能并确定深度学习应用的优化机会。

Deep learning applications are usually very compute-intensive and require a long run time for training and inference. This has been tackled by researchers from both hardware and software sides, and in this paper, we propose a Roofline-based approach to performance analysis to facilitate the optimization of these applications. This approach is an extension of the Roofline model widely used in traditional high-performance computing applications, and it incorporates both compute/bandwidth complexity and run time in its formulae to provide insights into deep learning-specific characteristics. We take two sets of representative kernels, 2D convolution and long short-term memory, to validate and demonstrate the use of this new approach, and investigate how arithmetic intensity, cache locality, auto-tuning, kernel launch overhead, and Tensor Core usage can affect performance. Compared to the common ad-hoc approach, this study helps form a more systematic way to analyze code performance and identify optimization opportunities for deep learning applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题