关于Finn矩阵矢量计算单元的RTL实施

论文标题

关于Finn矩阵矢量计算单元的RTL实施

On the RTL Implementation of FINN Matrix Vector Compute Unit

论文作者

Alam, Syed Asad, Gregg, David, Gambardella, Giulio, Preusser, Thomas, Blott, Michaela

论文摘要

基于FPGA的加速器在Deep神经网络中变得越来越流行，因为与数据流架构或自定义数据类型的专业化程度提高了性能的能力。为了减少软件工程师和数据科学家采用FPGA的障碍，已经引入了具有高级合成（HLS）的基于C ++的设计条目。与基于寄存器转移水平（RTL）的设计相比，它们提供了更高的抽象。 HLS在评估多维张量，卷积层或并行性的选项时，提供更快的开发时间，更好的可维护性和更大的代码探索灵活性。因此，HLS已被DNN加速器生成框架（例如Finn和HLS4ML）采用。在本文中，我们为Finn提供了一个替代的后端RTL库。我们研究并评估了一系列设计维度，基于RTL的实现与原始HLS变体。我们表明，对于较小的设计参数，RTL会产生明显较小的电路。但是，对于较大的电路，基于RTL的设计的查找表（LUT）计数略高，高达$ 15 \％$。另一方面，HLS始终需要更多的触发器（ffs）（增加量的命令）和阻塞RAM（BRAMS）（$ 2 \ times $ $）。这也会影响关键的路径延迟，RTL产生的电路速度明显更快，最高为$ 80 \％$。此外，RTL也从最低的$ 10 \ times $减少合成时间中受益。最后，使用用于网络入侵检测中的多层感知器（MLP）网络的现实世界用例实际验证了结果。总体而言，由于HLS框架代码生成硬件设计，因此与综合时间缩短了资源益处相比，设计条目中的宽松优势不太重要，这可能会使RTL抽象成为有吸引力的替代方案。

FPGA-based accelerators are becoming more popular for deep neural network due to the ability to scale performance with increasing degree of specialization with dataflow architectures or custom data types. To reduce the barrier for software engineers and data scientists to adopt FPGAs, C++- and OpenCL-based design entries with high-level synthesis (HLS) have been introduced. They provide higher abstraction compared to register-transfer level (RTL)-based design. HLS offers faster development time, better maintainability and more flexibility in code exploration, when evaluating options for multi-dimension tensors, convolutional layers or parallelism. Thus, HLS has been adopted by DNN accelerator generation frameworks such as FINN and hls4ml. In this paper, we present an alternative backend RTL library for FINN. We investigate and evaluate, across a spectrum of design dimensions, an RTL-based implementation versus the original HLS variant. We show that for smaller design parameters, RTL produces significantly smaller circuits. For larger circuits, however, the look-up table (LUT) count of RTL-based design is slightly higher, up to around $15\%$. On the other hand, HLS consistently requires more flip-flops (FFs) (orders-of-magnitude increase) and block RAMs (BRAMs) ($2\times$ more). This also impacts the critical path delay, with RTL producing significantly faster circuits, up to $80\%$. Furthermore, RTL also benefits from at-least a $10\times$ reduction in synthesis time. Finally the results were practically validated using a real-world use case of a multi-layer perceptron (MLP) network used in network intrusion detection. Overall, since HLS frameworks code-generate the hardware design, the benefits of the ease in the design entry is less important as compared to synthesis time reduction togther with resource benefits, this might make the RTL abstraction an attractive alternative.

下载PDF全文

下载文献需遵守相关版权规定

论文标题