论文标题
稀疏的注意力加速与协同的内存修剪和片上重新组件
Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation
论文作者
论文摘要
作为其核心计算,一种自我发场机制可以在整个输入序列中分配成对的相关性。尽管表现良好,但计算成对相关性的成本高昂。尽管最近的工作表明了注意力分数低的元素的运行时修剪的好处,但自我发挥机制的二次复杂性及其芯片内存能力的需求被忽略了。这项工作通过构建一个称为Sprint的加速器来解决这些约束,该加速器利用了RERAM横杆阵列的固有并行性以近似方式计算注意力分数。我们的设计使用RERAM内的轻质模拟阈值电路来降低注意力评分,从而使Sprint只能获取一小部分相关数据以进行芯片内存。为了减轻模型准确性的潜在负面影响,Sprint重新计算数字中少数获取数据的注意力评分。相关注意力分数的组合内置上的修剪和片上重新计算,Sprint可以将二次复杂性转化为仅线性的复杂性。此外,我们即使修剪后,我们也可以识别并利用相邻注意操作之间的动态空间位置,从而消除了昂贵但冗余的数据获取。我们在各种最新的变压器模型上评估了我们提出的技术。平均而言,当使用总16KB芯片内存时,Sprint会产生7.5倍的速度和19.6倍的能量,而实际上与基线模型的ISO临界率相当(平均为0.36%的降级)。
As its core computation, a self-attention mechanism gauges pairwise correlations across the entire input sequence. Despite favorable performance, calculating pairwise correlations is prohibitively costly. While recent work has shown the benefits of runtime pruning of elements with low attention scores, the quadratic complexity of self-attention mechanisms and their on-chip memory capacity demands are overlooked. This work addresses these constraints by architecting an accelerator, called SPRINT, which leverages the inherent parallelism of ReRAM crossbar arrays to compute attention scores in an approximate manner. Our design prunes the low attention scores using a lightweight analog thresholding circuitry within ReRAM, enabling SPRINT to fetch only a small subset of relevant data to on-chip memory. To mitigate potential negative repercussions for model accuracy, SPRINT re-computes the attention scores for the few fetched data in digital. The combined in-memory pruning and on-chip recompute of the relevant attention scores enables SPRINT to transform quadratic complexity to a merely linear one. In addition, we identify and leverage a dynamic spatial locality between the adjacent attention operations even after pruning, which eliminates costly yet redundant data fetches. We evaluate our proposed technique on a wide range of state-of-the-art transformer models. On average, SPRINT yields 7.5x speedup and 19.6x energy reduction when total 16KB on-chip memory is used, while virtually on par with iso-accuracy of the baseline models (on average 0.36% degradation).