论文标题
正确:重新评估事实一致性评估
TRUE: Re-evaluating Factual Consistency Evaluation
论文作者
论文摘要
接地的文本生成系统通常会生成包含事实不一致的文本,从而阻碍了其现实世界的适用性。自动事实一致性评估可以通过加速评估周期,过滤不一致的产出和增强培训数据来帮助减轻这一限制。在引起越来越多的关注的同时,这种评估指标通常是在筒仓中开发和评估单个任务或数据集的,从而减慢了它们的采用。此外,以前的元评估协议着重于与人类注释的系统级相关性,这使此类指标的示例级级准确性不清楚。在这项工作中,我们介绍了True:对现有文本的标准化集合的全面调查和评估,从各种任务中收集,手动注释以获取事实一致性。我们的标准化实现了一种示例级元评估协议,该协议比以前报道的相关性更可起作用和可解释,从而产生更清晰的质量度量。在各种最先进的指标和11个数据集中,我们发现大规模的NLI和问题产生和基于救援的方法可实现强大而互补的结果。我们建议这些方法作为模型和指标开发人员的起点,希望True将促进更好的评估方法的进步。
Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and augmenting training data. While attracting increasing attention, such evaluation metrics are usually developed and evaluated in silo for a single task or dataset, slowing their adoption. Moreover, previous meta-evaluation protocols focused on system-level correlations with human annotations, which leave the example-level accuracy of such metrics unclear. In this work, we introduce TRUE: a comprehensive survey and assessment of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations, yielding clearer quality measures. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results. We recommend those methods as a starting point for model and metric developers, and hope TRUE will foster progress towards even better evaluation methods.