在BLEU之外：我们应该如何评估代码生成模型的质量？

论文标题

在BLEU之外：我们应该如何评估代码生成模型的质量？

Out of the BLEU: how should we assess quality of the Code Generation models?

论文作者

Evtikhiev, Mikhail, Bogomolov, Egor, Sokolov, Yaroslav, Bryksin, Timofey

论文摘要

近年来，研究人员创建并引入了大量各种代码生成模型。由于对每个新模型版本的人类评估都是不可行的，因此社区采用了自动评估指标，例如BLEU来近似人类判断的结果。这些指标来自机器翻译域，目前尚不清楚它们是否适用于代码生成任务，以及他们对人类对此任务的评估的同意程度。还开发了其他指标，即Codebleu和Ruby，以估算代码的相似性，这些指标考虑了源代码的属性。但是，对于这些指标，几乎没有关于他们与人类评估一致的研究。尽管如此，最近在最近的论文中使用了公制分数的最小差异，以声称某些代码生成模型的优势比其他代码模型的优越性。在本文中，我们介绍了六个指标的适用性 - bleu，rouge-l，流星，CHRF，Codebleu和Ruby-用于评估代码生成模型。我们对两个不同的代码生成数据集进行了一项研究，并使用人体注释来评估这些数据集上所有模型的质量。结果表明，对于Python单线的Conala数据集，如果模型得分的差异小于5分，则没有一个指标能正确模拟人类判断哪个模型更好，> 95％的确定性。对于由特定结构的类别组成的炉石传说数据集，至少2分的模型得分差异足以声称一种模型比另一个模型的优越性。我们的发现表明，与常用的BLEU和CODEBLEU相比，CHRF指标更适合评估代码生成模型。但是，找到与人类紧密一致的代码生成的指标需要额外的工作。

In recent years, researchers have created and introduced a significant number of various code generation models. As human evaluation of every new model version is unfeasible, the community adopted automatic evaluation metrics such as BLEU to approximate the results of human judgement. These metrics originate from the machine translation domain and it is unclear whether they are applicable for the code generation tasks and how well they agree with the human evaluation on this task. There are also other metrics, CodeBLEU and RUBY, developed to estimate the similarity of code, that take into account the properties of source code. However, for these metrics there are hardly any studies on their agreement with the human evaluation. Despite all that, minimal differences in the metric scores have been used in recent papers to claim superiority of some code generation models over the others. In this paper, we present a study on the applicability of six metrics -- BLEU, ROUGE-L, METEOR, ChrF, CodeBLEU, and RUBY -- for evaluation of code generation models. We conduct a study on two different code generation datasets and use human annotators to assess the quality of all models run on these datasets. The results indicate that for the CoNaLa dataset of Python one-liners, none of the metrics can correctly emulate human judgement on which model is better with >95% certainty if the difference in model scores is less than 5 points. For the HearthStone dataset, which consists of classes of a particular structure, a difference in model scores of at least 2 points is enough to claim the superiority of one model over the other. Our findings suggest that the ChrF metric is a better fit for the evaluation of code generation models than the commonly used BLEU and CodeBLEU. Yet, finding a metric for code generation that closely agrees with humans requires additional work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题