论文标题
Puppeteer:一位基于森林的随机经理,用于跨内存层次结构
Puppeteer: A Random Forest-based Manager for Hardware Prefetchers across the Memory Hierarchy
论文作者
论文摘要
多年来,处理器吞吐量一直在稳步增加。但是,记忆吞吐量并未以相同的速率增加,这导致记忆墙问题又增加了有效和理论峰值处理器性能之间的差距。为了应对这一点,在数据/说明预摘要设计领域有很多工作。从广义上讲,预摘要可以预测未来的数据/指令地址访问,并在内存层次结构中主动获取数据/说明,以降低数据/指令访问延迟。为此,在内存层次结构的每个级别上部署了一个或多个预摘要,但通常,每个预摘要都会隔离地设计,而无需全面考虑系统中的其他预购器。结果,单个预摘要并不总是相互补充,这会导致平均绩效提高和/或许多负异常值。在这项工作中,我们提出了Puppeteer,这是一个硬件预摘要管理器,它使用一套随机森林回归器来确定在运行时确定哪个预摘要应在内存层次结构中的每个级别上启动,以便预防器相互补充并减少数据/指令访问延迟。与没有预脱水器的设计相比,使用Puppeteer,我们将IPC提高了1核(1C)(1C)的46.0%,4核(4C)中的25.8%,8核(8C)处理器中的IPC平均在Spec2006,Spec2006和cloud suit cound suit中平均在8个核心(8C)处理器中提高,并且具有〜10KB的云层。此外,与最先进的情况相比,我们还将负异常值的数量减少了89%以上,最差的负异常值的性能损失从25%降低到只有5%。
Over the years, processor throughput has steadily increased. However, the memory throughput has not increased at the same rate, which has led to the memory wall problem in turn increasing the gap between effective and theoretical peak processor performance. To cope with this, there has been an abundance of work in the area of data/instruction prefetcher designs. Broadly, prefetchers predict future data/instruction address accesses and proactively fetch data/instructions in the memory hierarchy with the goal of lowering data/instruction access latency. To this end, one or more prefetchers are deployed at each level of the memory hierarchy, but typically, each prefetcher gets designed in isolation without comprehensively accounting for other prefetchers in the system. As a result, individual prefetchers do not always complement each other, and that leads to lower average performance gains and/or many negative outliers. In this work, we propose Puppeteer, which is a hardware prefetcher manager that uses a suite of random forest regressors to determine at runtime which prefetcher should be ON at each level in the memory hierarchy, such that the prefetchers complement each other and we reduce the data/instruction access latency. Compared to a design with no prefetchers, using Puppeteer we improve IPC by 46.0% in 1 Core (1C), 25.8% in 4 Core (4C), and 11.9% in 8 Core (8C) processors on average across traces generated from SPEC2017, SPEC2006, and Cloud suites with ~10KB overhead. Moreover, we also reduce the number of negative outliers by over 89%, and the performance loss of the worst-case negative outlier from 25% to only 5% compared to the state-of-the-art.