PONE：开放域生成对话系统的新型自动评估指标

论文标题

PONE：开放域生成对话系统的新型自动评估指标

PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems

论文作者

Lan, Tian, Mao, Xian-Ling, Wei, Wei, Gao, Xiaoyan, Huang, Heyan

论文摘要

在过去的几年中，开放域的生成对话系统引起了广泛的关注。当前，如何自动评估它们仍然是一个巨大的挑战问题。据我们所知，有三种自动方法可以评估开放域的生成对话系统：（1）基于单词反拼图的指标；（2）基于嵌入的指标；（3）基于学习的指标。由于缺乏系统的比较，尚不清楚哪种指标更有效。在本文中，我们将首先在同一实验设置上系统地测量各种自动评估指标，以检查哪种是最好的。通过广泛的实验，基于学习的指标证明它们是开放域生成对话系统的最有效评估指标。此外，我们观察到，几乎所有基于学习的指标都取决于负抽样机制，后者获得了极度不平衡且低质量的数据集来训练分数模型。为了解决这个问题，我们提出了一种新颖且可行的基于学习的指标，可以通过使用增强的积极样本和称为Pone的有价值的负面样本来显着改善与人类判断的相关性。广泛的实验表明，我们提出的评估方法显着优于基于最新的学习评估方法，平均相关性改善为13.18％。此外，我们已经公开发布了我们提出的方法和最先进的基线的代码。

Open-domain generative dialogue systems have attracted considerable attention over the past few years. Currently, how to automatically evaluate them, is still a big challenge problem. As far as we know, there are three kinds of automatic methods to evaluate the open-domain generative dialogue systems: (1) Word-overlap-based metrics; (2) Embedding-based metrics; (3) Learning-based metrics. Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective. In this paper, we will first measure systematically all kinds of automatic evaluation metrics over the same experimental setting to check which kind is best. Through extensive experiments, the learning-based metrics are demonstrated that they are the most effective evaluation metrics for open-domain generative dialogue systems. Moreover, we observe that nearly all learning-based metrics depend on the negative sampling mechanism, which obtains an extremely imbalanced and low-quality dataset to train a score model. In order to address this issue, we propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments by using augmented POsitive samples and valuable NEgative samples, called PONE. Extensive experiments demonstrate that our proposed evaluation method significantly outperforms the state-of-the-art learning-based evaluation methods, with an average correlation improvement of 13.18%. In addition, we have publicly released the codes of our proposed method and state-of-the-art baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题