伯特中有关代码关注的探索性研究

论文标题

伯特中有关代码关注的探索性研究

An Exploratory Study on Code Attention in BERT

论文作者

Sharma, Rishab, Chen, Fuxiang, Fard, Fatemeh, Lo, David

论文摘要

软件工程中的许多最新模型都基于变压器体系结构引入了深层神经模型，或使用经过代码培训的基于变压器的预训练语言模型（PLM）。尽管这些模型达到了艺术状态，但导致许多下游任务，例如代码摘要和错误检测，但它们基于变形金刚和PLM，这些任务主要在自然语言处理（NLP）字段中研究。尽管自然语言和编程语言之间存在差异，但目前的研究依赖于NLP对这些模型的推理和实践。关于如何建模代码的文献也有限。在这里，我们研究了PLM对代码的关注行为，并将其与自然语言进行比较。我们将基于变压器的PLM预先培训BERT，并探讨了它所学的语义和句法所学的信息。我们运行了几项实验，以分析彼此的注意力构建体的注意值以及每一层中伯特学到的内容。我们的分析表明，与NLP中参与者最多的令牌[CLS]相比，Bert对句法实体，特别是标识符和分离器的更多关注。该观察结果促使我们利用标识符表示代码序列，而不是用于代码克隆检测时的[CLS]令牌。我们的结果表明，采用识别剂的嵌入式嵌入将BERT的性能提高了605％和4％的F1分数，其下层和上层分别提高了BERT的性能。当标识符的嵌入在Codebert（基于代码的PLM）中时，在克隆检测的F1评分中，性能提高了21-24％。这些发现可以通过使用特定于代码的表示，而不是应用NLP中使用的常见嵌入方式，并开放新的方向来开发具有相似性能的较小模型，从而使研究社区受益。

Many recent models in software engineering introduced deep neural models based on the Transformer architecture or use transformer-based Pre-trained Language Models (PLM) trained on code. Although these models achieve the state of the arts results in many downstream tasks such as code summarization and bug detection, they are based on Transformer and PLM, which are mainly studied in the Natural Language Processing (NLP) field. The current studies rely on the reasoning and practices from NLP for these models in code, despite the differences between natural languages and programming languages. There is also limited literature on explaining how code is modeled. Here, we investigate the attention behavior of PLM on code and compare it with natural language. We pre-trained BERT, a Transformer based PLM, on code and explored what kind of information it learns, both semantic and syntactic. We run several experiments to analyze the attention values of code constructs on each other and what BERT learns in each layer. Our analyses show that BERT pays more attention to syntactic entities, specifically identifiers and separators, in contrast to the most attended token [CLS] in NLP. This observation motivated us to leverage identifiers to represent the code sequence instead of the [CLS] token when used for code clone detection. Our results show that employing embeddings from identifiers increases the performance of BERT by 605% and 4% F1-score in its lower layers and the upper layers, respectively. When identifiers' embeddings are used in CodeBERT, a code-based PLM, the performance is improved by 21-24% in the F1-score of clone detection. The findings can benefit the research community by using code-specific representations instead of applying the common embeddings used in NLP, and open new directions for developing smaller models with similar performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题