Snitch：一个小型伪发行处理器，用于面积和能源有效执行浮点密集型工作负载

论文标题

Snitch：一个小型伪发行处理器，用于面积和能源有效执行浮点密集型工作负载

Snitch: A tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads

论文作者

Zaruba, Florian, Schuiki, Fabian, Hoefler, Torsten, Benini, Luca

论文摘要

数据并行应用程序（例如数据分析，机器学习和科学计算）正在对新兴系统每秒的浮点操作不断增长。随着集成密度的提高，对能源效率的追求成为第一大设计关注。虽然专用的加速器提供了高能效，但它们被过度特征化，并且很难适应算法变化。我们提出了一个建筑概念，该概念可以解决实现极端能源效率的问题，同时仍然保持高灵活性作为通用计算引擎。关键的想法是将一个小的10kge控制核心（称为Snitch）与双精度FPU配对，以调整计算与控制比。尽管传统上最大程度地减少非FPU区域并实现高浮点利用率一直是一个权衡，但随着Snitch，我们通过使用两个最小侵入性扩展的ISA来实现这两个：流语义寄存器（SSR）（SSR）和一项浮点点重复指令（FREP）。 SSR允许核心将寄存器读取/写入隐式编码加载/存储指令，从而消除许多明确的内存指令。 FREP扩展通过对微环缓冲区进行测序指令来分解浮点和整数管道。这些ISA扩展大大减轻了核心的压力，并将其释放到其他任务上，使Snitch和FPU有效双重发行，以最低的增量成本为3.2％。两个低开销的ISA扩展使Snitch比当代矢量处理器巷更灵活，获得了$ 2 \ times $的能源效率提高。我们已经评估了22nm技术中八核群集上提出的核心和ISA扩展。我们获得了$ 5 \ times $ $多核的速度，并在几个并行微粒上获得了$ 3.5 \ times $的能源效率。

Data-parallel applications, such as data analytics, machine learning, and scientific computing, are placing an ever-growing demand on floating-point operations per second on emerging systems. With increasing integration density, the quest for energy efficiency becomes the number one design concern. While dedicated accelerators provide high energy efficiency, they are over-specialized and hard to adjust to algorithmic changes. We propose an architectural concept that tackles the issues of achieving extreme energy efficiency while still maintaining high flexibility as a general-purpose compute engine. The key idea is to pair a tiny 10kGE control core, called Snitch, with a double-precision FPU to adjust the compute to control ratio. While traditionally minimizing non-FPU area and achieving high floating-point utilization has been a trade-off, with Snitch, we achieve them both, by enhancing the ISA with two minimally intrusive extensions: stream semantic registers (SSR) and a floating-point repetition instruction (FREP). SSRs allow the core to implicitly encode load/store instructions as register reads/writes, eliding many explicit memory instructions. The FREP extension decouples the floating-point and integer pipeline by sequencing instructions from a micro-loop buffer. These ISA extensions significantly reduce the pressure on the core and free it up for other tasks, making Snitch and FPU effectively dual-issue at a minimal incremental cost of 3.2%. The two low overhead ISA extensions make Snitch more flexible than a contemporary vector processor lane, achieving a $2\times$ energy-efficiency improvement. We have evaluated the proposed core and ISA extensions on an octa-core cluster in 22nm technology. We achieve more than $5\times$ multi-core speed-up and a $3.5\times$ gain in energy efficiency on several parallel microkernels.

下载PDF全文

下载文献需遵守相关版权规定

论文标题