解构NLG评估：评估实践，假设及其含义

论文标题

解构NLG评估：评估实践，假设及其含义

Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications

论文作者

Zhou, Kaitlyn, Blodgett, Su Lin, Trischler, Adam, Daumé III, Hal, Suleman, Kaheer, Olteanu, Alexandra

论文摘要

有很多方法可以在文本中表达类似的内容，这使得评估自然语言生成（NLG）系统变得困难。加剧困难是需要根据部署设置来评估各种质量标准。尽管NLG评估的景观绘制得很好，但从业者的目标，假设和约束（这些景观都可以为关于什么，何时和如何评估的决定提供依据，通常是部分或隐含的，或者根本没有说明。将NLG从业人员（n = 18）的形成性半结构化访谈研究与对更广泛的从业人员样本的调查研究（n = 61）相结合，我们表达了目标，社区实践，假设和约束，从而塑造了NLG评估，检查了他们的含义以及如何体现伦理考虑。

There are many ways to express similar things in text, which makes evaluating natural language generation (NLG) systems difficult. Compounding this difficulty is the need to assess varying quality criteria depending on the deployment setting. While the landscape of NLG evaluation has been well-mapped, practitioners' goals, assumptions, and constraints -- which inform decisions about what, when, and how to evaluate -- are often partially or implicitly stated, or not stated at all. Combining a formative semi-structured interview study of NLG practitioners (N=18) with a survey study of a broader sample of practitioners (N=61), we surface goals, community practices, assumptions, and constraints that shape NLG evaluations, examining their implications and how they embody ethical considerations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题