用于共享L1 - 内存多处理器群集的节能硬件加速同步

论文标题

用于共享L1 - 内存多处理器群集的节能硬件加速同步

Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters

论文作者

Glaser, Florian, Tagliavini, Giuseppe, Rossi, Davide, Haugou, Germain, Huang, Qiuting, Benini, Luca

论文摘要

对高功率和能源受限的处理系统（例如，Things Internet）（IoT）等高功率和能源受限的加工系统的急剧性能需求导致了近阈值计算（NTC）的平行，从而加入了低压操作的能源效率益处，以及典型的平行系统的性能。共享的L1内存多处理器群集是一个有前途的架构，可按GOPS和100多个共和党/W的能源效率的效果提供性能。但是，只能通过最大化簇中可用的处理元素（PES）的有效利用来达到这种计算效率。除了这项工作之外，PE-PE同步和通信的优化是性能的关键因素。在这项工作中，我们描述了一个轻巧的硬件加速同步和通信单元（SCU），用于紧密耦合的处理器。我们详细介绍了该体系结构，该体系结构可以使细粒度的每人电源管理及其集成到八核RISC-V处理器中。为了验证所提出的解决方案的有效性，我们在高级22NM FDX技术中实施了八核群集，并通过可调的微型计算和一组现实生活中的应用和内核评估了性能和能源效率。提出的解决方案允许基于快速测试和对L1存储器的基线实现小41次循环的无同步区域，而在将Microbenchs限制为10％的同步开销时，基于快速的测试和集合访问。当对现实生活中的DSP应用程序进行评估时，拟议的SCU平均将绩效平均提高92％和23％，平均提高了98％和39％。

The steeply growing performance demands for highly power- and energy-constrained processing systems such as end-nodes of the internet-of-things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency benefits of low-voltage operation with the performance typical of parallel systems. Shared-L1-memory multiprocessor clusters are a promising architecture, delivering performance in the order of GOPS and over 100 GOPS/W of energy-efficiency. However, this level of computational efficiency can only be reached by maximizing the effective utilization of the processing elements (PEs) available in the clusters. Along with this effort, the optimization of PE-to-PE synchronization and communication is a critical factor for performance. In this work, we describe a light-weight hardware-accelerated synchronization and communication unit (SCU) for tightly-coupled clusters of processors. We detail the architecture, which enables fine-grain per-PE power management, and its integration into an eight-core cluster of RISC-V processors. To validate the effectiveness of the proposed solution, we implemented the eight-core cluster in advanced 22nm FDX technology and evaluated performance and energy-efficiency with tunable microbenchmarks and a set of real-life applications and kernels. The proposed solution allows synchronization-free regions as small as 42 cycles, over 41 times smaller than the baseline implementation based on fast test-and-set access to L1 memory when constraining the microbenchmarks to 10% synchronization overhead. When evaluated on the real-life DSP-applications, the proposed SCU improves performance by up to 92% and 23% on average and energy efficiency by up to 98% and 39% on average.

下载PDF全文

下载文献需遵守相关版权规定

论文标题