对AMR到英国生成系统的人类评估

论文标题

对AMR到英国生成系统的人类评估

A Human Evaluation of AMR-to-English Generation Systems

论文作者

Manning, Emma, Wein, Shira, Schneider, Nathan

论文摘要

仅使用自动化指标（例如BLEU）评估了大多数用于从抽象含义表示（AMR）生成英语文本的最新系统，该系统已知对自然语言的产生而言是有问题的。在这项工作中，我们介绍了一种新的人类评估的结果，该评估收集了一些最近的AMR生成系统，该评估收集了流利度和适当性评分以及错误类型的分类。我们讨论了这些系统的相对质量以及我们的结果与自动指标的相比，发现尽管指标大多在总体上排名整体方面取得了成功，但收集人类判断可以进行更多细微的比较。我们还分析了这些系统造成的常见错误。

Most current state-of-the art systems for generating English text from Abstract Meaning Representation (AMR) have been evaluated only using automated metrics, such as BLEU, which are known to be problematic for natural language generation. In this work, we present the results of a new human evaluation which collects fluency and adequacy scores, as well as categorization of error types, for several recent AMR generation systems. We discuss the relative quality of these systems and how our results compare to those of automatic metrics, finding that while the metrics are mostly successful in ranking systems overall, collecting human judgments allows for more nuanced comparisons. We also analyze common errors made by these systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题