CV4CODE：通过视觉代码表示的源代码理解

论文标题

CV4CODE：通过视觉代码表示的源代码理解

CV4Code: Sourcecode Understanding via Visual Code Representations

论文作者

Shi, Ruibo, Tao, Lili, Saphal, Rohan, Silavong, Fran, Moran, Sean J.

论文摘要

我们提出CV4Code，这是一种紧凑而有效的计算机视觉方法，可用于源代码理解。我们的方法通过将每个片段视为二维图像来利用上下文和从代码段中获得的结构信息，该图像自然地编码上下文并通过明确的空间表示来保留基本的结构信息。为了将摘要编纂为图像，我们提出了一个基于ASCII代码点的图像表示，该表示促进了快速生成SourceCode图像并消除了由RGB像素表示形式引起的编码中的冗余。此外，由于源代码被视为图像，因此不需要词汇分析（象征性）或语法树解析，这使得提出的方法不可依sovertic对任何特定的编程语言和从应用程序管道的角度来看轻量轻量级。 CV4Code甚至可以特征句法上的不正确代码，这是从取决于抽象语法树（AST）的方法中不可能的。我们通过学习卷积和变压器网络来证明CV4码的有效性，以预测直接从其二维表示的源代码的功能任务，即其解决的问题，并使用其潜在空间中的嵌入来得出回收设置中的两个代码摘要的相似性。实验结果表明，与具有相同任务和数据配置的其他方法相比，我们的方法可实现最先进的性能。我们第一次展示将源代码理解作为图像处理任务的一种形式的好处。

We present CV4Code, a compact and effective computer vision method for sourcecode understanding. Our method leverages the contextual and the structural information available from the code snippet by treating each snippet as a two-dimensional image, which naturally encodes the context and retains the underlying structural information through an explicit spatial representation. To codify snippets as images, we propose an ASCII codepoint-based image representation that facilitates fast generation of sourcecode images and eliminates redundancy in the encoding that would arise from an RGB pixel representation. Furthermore, as sourcecode is treated as images, neither lexical analysis (tokenisation) nor syntax tree parsing is required, which makes the proposed method agnostic to any particular programming language and lightweight from the application pipeline point of view. CV4Code can even featurise syntactically incorrect code which is not possible from methods that depend on the Abstract Syntax Tree (AST). We demonstrate the effectiveness of CV4Code by learning Convolutional and Transformer networks to predict the functional task, i.e. the problem it solves, of the source code directly from its two-dimensional representation, and using an embedding from its latent space to derive a similarity score of two code snippets in a retrieval setup. Experimental results show that our approach achieves state-of-the-art performance in comparison to other methods with the same task and data configurations. For the first time we show the benefits of treating sourcecode understanding as a form of image processing task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题