使用codesearchnet语料库学习深层语义模型

论文标题

使用codesearchnet语料库学习深层语义模型

Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus

论文作者

Wu, Chen, Yan, Ming

论文摘要

语义代码搜索是自然语言查询检索相关代码段的任务。与典型的信息检索任务不同，代码搜索需要弥合编程语言和自然语言之间的语义差距，以更好地描述内在的概念和语义。最近，用于代码搜索的深度神经网络一直是一个热门的研究主题。神经代码搜索的典型方法首先表示代码段，并查询文本作为单独的嵌入方式，然后使用向量距离（例如，点产物或余弦）来计算它们之间的语义相似性。存在许多不同的方法，将代码长度或查询令牌的可变长度汇总到可学习的嵌入中，包括bi-编码器，跨编码器和poly-编码器。查询编码器和代码编码器的目的是生成相关的查询对以及相应的所需代码段的嵌入，其中编码器的选择和设计非常重要。在本文中，我们提出了一个新型的深层语义模型，该模型不仅利用多模式源的实用程序，而且还具有提取器，例如自我注意力，聚合的向量，中间表示的组合。我们应用提出的模型来应对有关语义代码搜索的CodesearchNet挑战。我们将多模式学习的跨语言嵌入与大批次和硬采矿相结合，并结合了不同的学习表示形式，以更好地增强表示形式学习。我们的模型在CodesearchNet语料库上进行了培训，并对固定数据进行了评估，最终模型可实现0.384 NDCG，并赢得了该基准测试的第一名。模型和代码可在https://github.com/overwindows/semanticcodesearch.git上找到。

Semantic code search is the task of retrieving relevant code snippet given a natural language query. Different from typical information retrieval tasks, code search requires to bridge the semantic gap between the programming language and natural language, for better describing intrinsic concepts and semantics. Recently, deep neural network for code search has been a hot research topic. Typical methods for neural code search first represent the code snippet and query text as separate embeddings, and then use vector distance (e.g. dot-product or cosine) to calculate the semantic similarity between them. There exist many different ways for aggregating the variable length of code or query tokens into a learnable embedding, including bi-encoder, cross-encoder, and poly-encoder. The goal of the query encoder and code encoder is to produce embeddings that are close with each other for a related pair of query and the corresponding desired code snippet, in which the choice and design of encoder is very significant. In this paper, we propose a novel deep semantic model which makes use of the utilities of not only the multi-modal sources, but also feature extractors such as self-attention, the aggregated vectors, combination of the intermediate representations. We apply the proposed model to tackle the CodeSearchNet challenge about semantic code search. We align cross-lingual embedding for multi-modality learning with large batches and hard example mining, and combine different learned representations for better enhancing the representation learning. Our model is trained on CodeSearchNet corpus and evaluated on the held-out data, the final model achieves 0.384 NDCG and won the first place in this benchmark. Models and code are available at https://github.com/overwindows/SemanticCodeSearch.git.

下载PDF全文

下载文献需遵守相关版权规定

论文标题