论文标题
现代IDE中代码完成的序列模型设计
Sequence Model Design for Code Completion in the Modern IDE
论文作者
论文摘要
代码完成在现代综合开发环境(IDE)中起着重要作用。机器学习在类似的自然语言写作和搜索软件中已变得无处不在,浮出水面更相关的自动完成和搜索建议较少。先前的研究报告了培训高临界性,深层神经网络,用于建模源代码,但很少关注交互式开发人员工具所施加的实际约束。特别是,源代码建模的神经语言模型可能像深度神经网络中描述的那样是建模源代码的最佳选择,围绕代码完成构建,但仅报告下一步预测的准确性。但是,为了使语言模型(LM)在实际代码完成系统中良好工作,它还必须始终提出建议,以产生有效的代码,以支持代码完成在正确度检查中的作用;返回瞬时结果,以更有效地帮助程序员编码较少的击键;并且足够小,可以在开发人员工作站上舒适地适合磁盘和记忆,因为几乎所有现代IDE都在本地运行并支持离线使用情况。为了满足这些额外的要求,我们提出了一种新颖的设计,用于预测近来的top-kext subl ofernation'列举所有有效的关键字和范围内标识符和语言模型在其上放置概率分布的能力的能力。我们的模型将字符级输入表示形式与令牌输出混合在一起,以有意义地代表vocabulary(OOV)代币,并最大程度地减少预测潜伏期。可以通过检测软件中常见的局部重复来预测OOV令牌。该设计实现了源代码建模中的最先进的准确性,并适合现代代码完成实现在现代IDE中施加的约束。
Code completion plays a prominent role in modern integrated development environments (IDEs). Machine learning has become ubiquitous in analogous natural language writing and search software, surfacing more relevant autocompletions and search suggestions in fewer keystrokes. Prior research has reported training high-accuracy, deep neural networks for modeling source code, but little attention has been given to the practical constraints imposed by interactive developer tools. In particular, neural language models for source code modeling like the one described in Maybe Deep Neural Networks are the Best Choice for Modeling Source Code are framed around code completion, but only report accuracy of next-token prediction. However, in order for a language model (LM) to work well within real-world code completion systems, it must also always make suggestions that produce valid code that typechecks to support code completion's role in correctness-checking; return instantaneous results to help programmers code more efficiently in fewer keystrokes; and be small enough to fit comfortably on disk and in memory on developer workstations, since virtually all modern IDEs run locally and support offline usage. To meet these additional requirements, we propose a novel design for predicting top-k next tokens that combines static analysis' ability to enumerate all valid keywords and in-scope identifiers with the ability of a language model to place a probability distribution over them. Our model mixes character-level input representation with token output to represent out-of-vocabulary (OOV) tokens meaningfully and minimize prediction latency. OOV tokens can be predicted through detection of local repetition common in software. This design achieves state-of-art accuracy in source code modeling and fits the constraints imposed by real-world code completion implementations in modern IDEs.