人类评估中的真实差距

论文标题

人类评估中的真实差距

The Authenticity Gap in Human Evaluation

论文作者

Ethayarajh, Kawin, Jurafsky, Dan

论文摘要

人类评分是NLG评估的黄金标准。标准协议是通过平均分数收集生成文本的评分，跨注释者的平均值以及对NLG系统的排名。但是，几乎没有考虑这种方法是否忠实地抓住了人类的偏好。通过经济学的效用理论分析该标准协议，我们确定了它对注释者的隐含假设。这些假设通常在实践中违反，在这种情况下，注释者等级不再反映其偏好。最严重的违规行为来自使用李克特量表，事实证明，在某些情况下，这可以扭转真正偏好的方向。我们建议对标准协议进行改进，以使其在理论上变得更加合理，但是即使以改进的形式，它也不能用于评估诸如故事生成之类的开放式任务。对于后者，我们提出了一种新的人类评估协议，称为$ \ textit {系统级概率评估} $（SPA）。当人类对故事的评估是通过水疗中心完成的，我们可以按大小恢复GPT-3模型的顺序，并具有统计学意义的结果。但是，当通过标准协议进行人类评估时，只能恢复不到一半的预期偏好（例如，尽管使用了高功率的测试，但$ \ texttt {curie} $与$ \ texttt {davinci} $之间没有显着差异）。

Human ratings are the gold standard in NLG evaluation. The standard protocol is to collect ratings of generated text, average across annotators, and rank NLG systems by their average scores. However, little consideration has been given as to whether this approach faithfully captures human preferences. Analyzing this standard protocol through the lens of utility theory in economics, we identify the implicit assumptions it makes about annotators. These assumptions are often violated in practice, in which case annotator ratings cease to reflect their preferences. The most egregious violations come from using Likert scales, which provably reverse the direction of the true preference in certain cases. We suggest improvements to the standard protocol to make it more theoretically sound, but even in its improved form, it cannot be used to evaluate open-ended tasks like story generation. For the latter, we propose a new human evaluation protocol called $\textit{system-level probabilistic assessment}$ (SPA). When human evaluation of stories is done with SPA, we can recover the ordering of GPT-3 models by size, with statistically significant results. However, when human evaluation is done with the standard protocol, less than half of the expected preferences can be recovered (e.g., there is no significant difference between $\texttt{curie}$ and $\texttt{davinci}$, despite using a highly powered test).

下载PDF全文

下载文献需遵守相关版权规定

论文标题