论文标题
一项针对Sttram L1缓存的运行时自适应预摘要的研究
A Study of Runtime Adaptive Prefetching for STTRAM L1 Caches
论文作者
论文摘要
自旋转移扭矩RAM(StTRAM)是由于几个优势而在片上caches中SRAM的有前途的替代品。这些优点包括非挥发性,低泄漏,高积分密度和CMOS兼容性。先前的研究表明,由于Sttram较低的写入能量和较短的保留时间延迟,因此在没有明显的延迟开销的情况下,放松和适应运行时应用需求的时间可以大大减少总体缓存能量。在本文中,作为跨越Sttram缓存层次结构有效预取的第一步,我们研究了降低的保留sttram L1缓存的预取。使用Spec CPU 2017基准测试,我们分析了不同应用程序中不同的Sttram高速缓存保留时间中不同预取距离的能量和潜伏期影响。我们证明了Execired_unused_prefetches - - 缩短的保留时间Sttram Cache所过期的未使用预取的数量 - 可以准确地确定能耗和访问延迟的最佳保留时间。这个新的指标还可以提供有关最佳预取距离的见解,以实现内存带宽消耗和预取精度。基于我们的分析和见解,我们提出了预取意识的保留时间调整(部分)和基于时间的预取控制(RPC)。与基本Sttram缓存相比,部分和RPC共同将平均缓存能量和潜伏期分别降低了22.24%和24.59%。当最先进的近距离预取节推力(NST)增强基本体系结构时,Part+RPC分别将平均缓存能量和潜伏期降低了3.50%和3.59%,并将硬件开销降低了54.55%
Spin-Transfer Torque RAM (STTRAM) is a promising alternative to SRAM in on-chip caches due to several advantages. These advantages include non-volatility, low leakage, high integration density, and CMOS compatibility. Prior studies have shown that relaxing and adapting the STTRAM retention time to runtime application needs can substantially reduce overall cache energy without significant latency overheads, due to the lower STTRAM write energy and latency in shorter retention times. In this paper, as a first step towards efficient prefetching across the STTRAM cache hierarchy, we study prefetching in reduced retention STTRAM L1 caches. Using SPEC CPU 2017 benchmarks, we analyze the energy and latency impact of different prefetch distances in different STTRAM cache retention times for different applications. We show that expired_unused_prefetches---the number of unused prefetches expired by the reduced retention time STTRAM cache---can accurately determine the best retention time for energy consumption and access latency. This new metric can also provide insights into the best prefetch distance for memory bandwidth consumption and prefetch accuracy. Based on our analysis and insights, we propose Prefetch-Aware Retention time Tuning (PART) and Retention time-based Prefetch Control (RPC). Compared to a base STTRAM cache, PART and RPC collectively reduced the average cache energy and latency by 22.24% and 24.59%, respectively. When the base architecture was augmented with the state-of-the-art near-side prefetch throttling (NST), PART+RPC reduced the average cache energy and latency by 3.50% and 3.59%, respectively, and reduced the hardware overhead by 54.55%