达斯汀：一个16核平行的超低功率群集，具有2B至32B的完全灵活的位精确和矢量锁定执行模式

论文标题

达斯汀：一个16核平行的超低功率群集，具有2B至32B的完全灵活的位精确和矢量锁定执行模式

Dustin: A 16-Cores Parallel Ultra-Low-Power Cluster with 2b-to-32b Fully Flexible Bit-Precision and Vector Lockstep Execution Mode

论文作者

Ottavi, Gianmarco, Garofalo, Angelo, Tagliavini, Giuseppe, Conti, Francesco, Di Mauro, Alfio, Benini, Luca, Rossi, Davide

论文摘要

计算密集型算法（例如深神经网络（DNN））正在成为边缘设备的杀手应用。在资源受限和电池供电的设备上大量移植数据并行算法提出了与内存足迹，计算吞吐量和能源效率有关的几个挑战。低位宽度和混合精液算术已被证明是解决这些问题的有效策略。我们提出了Dustin，这是一个完全可编程的计算集群，它集成了16个能够2到32位算术的RISC-V核心，并且所有可能的混合精液排列。除了传统的多指导多数据（MIMD）处理范式外，Dustin还引入了矢量锁定执行模式（VLEM），以最大程度地减少高数据并行内核中的功耗。在VLEM中，单个领导者Core提取说明并将其广播到15个追随者核心。时钟门控指令获取（如果）阶段和追随者核心的私人缓存导致38 \％的功率降低，而性能最少（<3％）。该集群以65 nm的CMOS技术实施，达到58个GOPS的峰值性能，峰值效率为1.15 TOPS/W。

Computationally intensive algorithms such as Deep Neural Networks (DNNs) are becoming killer applications for edge devices. Porting heavily data-parallel algorithms on resource-constrained and battery-powered devices poses several challenges related to memory footprint, computational throughput, and energy efficiency. Low-bitwidth and mixed-precision arithmetic have been proven to be valid strategies for tackling these problems. We present Dustin, a fully programmable compute cluster integrating 16 RISC-V cores capable of 2- to 32-bit arithmetic and all possible mixed-precision permutations. In addition to a conventional Multiple-Instruction Multiple-Data (MIMD) processing paradigm, Dustin introduces a Vector Lockstep Execution Mode (VLEM) to minimize power consumption in highly data-parallel kernels. In VLEM, a single leader core fetches instructions and broadcasts them to the 15 follower cores. Clock gating Instruction Fetch (IF) stages and private caches of the follower cores leads to 38\% power reduction with minimal performance overhead (<3%). The cluster, implemented in 65 nm CMOS technology, achieves a peak performance of 58 GOPS and a peak efficiency of 1.15 TOPS/W.

下载PDF全文

下载文献需遵守相关版权规定

论文标题