论文标题
伯特(Bert
How Far Does BERT Look At:Distance-based Clustering and Analysis of BERT$'$s Attention
论文作者
论文摘要
关于多头注意机制的最新研究,尤其是在伯特(Bert)等预训练模型中,向我们展示了分析机制各个方面的启发式方法和线索。由于大多数研究都集中在探查任务或隐藏状态上,因此以前的作品发现了通过启发式分析方法的一些原始注意力模式,但是针对注意力模式的更系统的分析仍然是原始的。在这项工作中,我们通过在一组提出的特征之上的无监督聚类将注意力图清楚地将注意力图聚集到显着不同的模式中,这证实了先前的观察结果。我们通过分析研究进一步研究了它们相应的功能。此外,我们提出的功能可用于解释和校准变压器模型中的不同注意力头。
Recent research on the multi-head attention mechanism, especially that in pre-trained models such as BERT, has shown us heuristics and clues in analyzing various aspects of the mechanism. As most of the research focus on probing tasks or hidden states, previous works have found some primitive patterns of attention head behavior by heuristic analytical methods, but a more systematic analysis specific on the attention patterns still remains primitive. In this work, we clearly cluster the attention heatmaps into significantly different patterns through unsupervised clustering on top of a set of proposed features, which corroborates with previous observations. We further study their corresponding functions through analytical study. In addition, our proposed features can be used to explain and calibrate different attention heads in Transformer models.