Piuma：可编程的集成统一内存体系结构

论文标题

Piuma：可编程的集成统一内存体系结构

PIUMA: Programmable Integrated Unified Memory Architecture

论文作者

Aananthakrishnan, Sriram, Abedin, Shamsul, Cave, Vincent, Checconi, Fabio, Bois, Kristof Du, Eyerman, Stijn, Fryman, Joshua B., Heirman, Wim, Howard, Jason, Hur, Ibrahim, Jain, Samkit, ~Landowski, Marek M., Ma, Kevin, Nelson, Jarrod, Pawlowski, Robert, Szkoda, Fabrizio Petrini Sebastian, Tayal, Sanjaya, Tithi, Jesmin Jahan, Vandriessche, Yves

论文摘要

高性能大型图分析对于及时分析大数据集中的关系至关重要。传统的处理器体系结构遭受了这些工作负载的效率低下的资源使用和不良规模的损失。为了启用高效且可扩展的图形分析，英特尔将可编程的统一内存体系结构（PIUMA）作为DARPA层次结构识别验证验证利用（HIVE）程序的一部分。 Piuma由许多多线程核心，细粒度的内存和网络访问，全球共享的地址空间，功能强大的卸载引擎以及紧密集成的光学互连网络组成。通过利用共包装的光学硅光子学并将片上的网状协议直接扩展到光学织物，系统中的所有piuma芯片都粘在大型虚拟模具中，即使系统尺度将系统尺度缩放到数千个插座上，也可以使极低的插座对插槽潜伏期。绩效估算项目，即Piuma节点将优于常规计算节点的大量级数量级。此外，Piuma继续跨多个节点扩展，这在常规的多节点设置中是一个挑战。本文介绍了Piuma架构，并记录了我们在设计和构建原型芯片及其培养过程方面的经验。我们总结了使用仿真工具和FPGA仿真的软件堆栈的共同设计的方法论。这些工具提供了对现实应用程序的早期性能估计，并允许我们在硬件，编译器，库和应用程序上实施许多优化。我们将Piuma芯片构建为316mm2 7nm Finfet CMOS模具，并构建了16节点系统。 Piuma Silicon已成功地展示了体系结构的关键方面，其中一些将纳入未来的Intel产品中。

High performance large scale graph analytics are essential to timely analyze relationships in big data sets. Conventional processor architectures suffer from inefficient resource usage and bad scaling on those workloads. To enable efficient and scalable graph analysis, Intel developed the Programmable Integrated Unified Memory Architecture (PIUMA) as a part of the DARPA Hierarchical Identify Verify Exploit (HIVE) program. PIUMA consists of many multi-threaded cores, fine-grained memory and network accesses, a globally shared address space, powerful offload engines and a tightly integrated optical interconnection network. By utilizing co-packaged optical silicon photonics and extending the on-chip mesh protocol directly to the optical fabric, all PIUMA chips in a system are glued together in a large virtual die which allows for extremely low socket-to-socket latencies even as the system scales to thousands of sockets. Performance estimations project that a PIUMA node will outperform a conventional compute node by one to two orders of magnitude. Furthermore, PIUMA continues to scale across multiple nodes, which is a challenge in conventional multi-node setups. This paper presents the PIUMA architecture, and documents our experience in designing and building a prototype chip and its bring-up process. We summarize the methodology for our co-design of the architecture together with the software stack using simulation tools and FPGA emulation. These tools provided early performance estimations of realistic applications and allowed us to implement many optimizations across the hardware, compilers, libraries and applications. We built the PIUMA chip as a 316mm2 7nm FinFET CMOS die and constructed a 16-node system. PIUMA silicon has successfully powered on demonstrating key aspects of the architecture, some of which will be incorporated into future Intel products.

下载PDF全文

下载文献需遵守相关版权规定

论文标题