论文标题
删除:用伯特和话语连贯评估文本生成
DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence
论文作者
论文摘要
最近,从话语连贯的角度设计文本生成系统的兴趣越来越大,例如,对句子之间的相互依赖性进行建模。尽管如此,最近基于BERT的评估指标在识别连贯性方面仍然很弱,因此在发现这些文本生成系统的话语级改进方面并不可靠。在这项工作中,我们介绍了一种参数化的话语指标,它使用BERT从不同的角度来模拟话语一致性,这是由中心理论驱动的。我们的实验涵盖了16个非课程和话语指标,包括脱节和流行的连贯模型,对摘要和文档级别的机器翻译(MT)进行了评估。我们发现(i)大多数基于BERT的指标与人类评级的连贯性更为严重,而不是早期的话语指标,该指标发明了十年前; (ii)在系统级别运行时,最近最新的Bartscore较弱 - 这尤其有问题,因为通常以这种方式比较系统。相比之下,二次与人类评级达到了强大的系统级相关性,不仅在连贯性和事实一致性和其他方面都达到了,并且平均超过10个相关点超过10个相关点。此外,为了了解障碍,我们为话语连贯性对于评估指标的重要性提供了理由,并解释了一种变体的优越性。我们的代码可在\ url {https://github.com/aiphes/discoscore}中找到。
Recently, there has been a growing interest in designing text generation systems from a discourse coherence perspective, e.g., modeling the interdependence between sentences. Still, recent BERT-based evaluation metrics are weak in recognizing coherence, and thus are not reliable in a way to spot the discourse-level improvements of those text generation systems. In this work, we introduce DiscoScore, a parametrized discourse metric, which uses BERT to model discourse coherence from different perspectives, driven by Centering theory. Our experiments encompass 16 non-discourse and discourse metrics, including DiscoScore and popular coherence models, evaluated on summarization and document-level machine translation (MT). We find that (i) the majority of BERT-based metrics correlate much worse with human rated coherence than early discourse metrics, invented a decade ago; (ii) the recent state-of-the-art BARTScore is weak when operated at system level -- which is particularly problematic as systems are typically compared in this manner. DiscoScore, in contrast, achieves strong system-level correlation with human ratings, not only in coherence but also in factual consistency and other aspects, and surpasses BARTScore by over 10 correlation points on average. Further, aiming to understand DiscoScore, we provide justifications to the importance of discourse coherence for evaluation metrics, and explain the superiority of one variant over another. Our code is available at \url{https://github.com/AIPHES/DiscoScore}.