Draco：合作化的硬件利用率和DNN在收缩期加速器上的性能

论文标题

Draco：合作化的硬件利用率和DNN在收缩期加速器上的性能

DRACO: Co-Optimizing Hardware Utilization, and Performance of DNNs on Systolic Accelerator

论文作者

Jha, Nandan Kumar, Ravishankar, Shreyas, Mittal, Sparsh, Kaushik, Arvind, Mandal, Dipan, Chandra, Mahesh

论文摘要

固定尺寸的收缩期加速器中的处理元素（PE）的数量与大型和计算结合的DNN良好匹配；鉴于，有记忆的DNN遭受PE的缺乏症，并且无法达到峰值性能和能源效率。为了减轻这种情况，已经提出了专门的数据流和/或微构造技术。但是，由于开发周期的较长和深度学习领域的进化速度，这些基于硬件的解决方案在处理最新的DNN的PE不足时可能会过时且无效。在这项工作中，我们应对算法前沿中PE不足的挑战，并提出数据重用意识到的合作化（DRACO）。这可以改善内存结合的DNN的PE利用，而无需进行数据流/微体系结构修改。此外，与以前的合作方法不同，Draco不仅提高了性能和能源效率，而且还提高了DNN的预测性能。据我们所知，Draco是在算法级别解决资源不足挑战的第一项工作，并证明了计算效率，PE利用率和DNN的预测性能之间的权衡。与最先进的行固定数据流相比，Draco的平均PE利用率和推理潜伏期分别提高了41.8％和42.6％，MobileNETV1的预测性能损失可忽略不计，$ 64 \ times 64 \ times64 $ SOSTOLICALY ARRAY。 Draco提供了用于利用感知的DNN设计方法的开创性见解，这些方法可以充分利用基于收缩期阵列的硬件加速器的计算能力。

The number of processing elements (PEs) in a fixed-sized systolic accelerator is well matched for large and compute-bound DNNs; whereas, memory-bound DNNs suffer from PE underutilization and fail to achieve peak performance and energy efficiency. To mitigate this, specialized dataflow and/or micro-architectural techniques have been proposed. However, due to the longer development cycle and the rapid pace of evolution in the deep learning fields, these hardware-based solutions can be obsolete and ineffective in dealing with PE underutilization for state-of-the-art DNNs. In this work, we address the challenge of PE underutilization at the algorithm front and propose data reuse aware co-optimization (DRACO). This improves the PE utilization of memory-bound DNNs without any additional need for dataflow/micro-architecture modifications. Furthermore, unlike the previous co-optimization methods, DRACO not only maximizes performance and energy efficiency but also improves the predictive performance of DNNs. To the best of our knowledge, DRACO is the first work that resolves the resource underutilization challenge at the algorithm level and demonstrates a trade-off between computational efficiency, PE utilization, and predictive performance of DNN. Compared to the state-of-the-art row stationary dataflow, DRACO achieves 41.8% and 42.6% improvement in average PE utilization and inference latency (respectively) with negligible loss in predictive performance in MobileNetV1 on a $64\times64$ systolic array. DRACO provides seminal insights for utilization-aware DNN design methodologies that can fully leverage the computation power of systolic array-based hardware accelerators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题