通过恢复源文本的抽象文档摘要进行预培训

论文标题

通过恢复源文本的抽象文档摘要进行预培训

Pre-training for Abstractive Document Summarization by Reinstating Source Text

论文作者

Zou, Yanyan, Zhang, Xingxing, Lu, Wei, Wei, Furu, Zhou, Ming

论文摘要

抽象文档摘要通常被建模为序列到序列（SEQ2SEQ）学习问题。不幸的是，基于有限的监督摘要数据培训大型基于SEQ2SEQ的摘要模型具有挑战性。本文提出了三个预训练的目标，使我们能够在未标记的文本上预先培训基于SEQ2SEQ的抽象摘要模型。主要思想是，鉴于通过文档人为构建的输入文本，模型已预先培训以恢复原始文档。这些目标包括重新排序，下一个句子生成和蒙版文档生成，这些文档与抽象文档摘要任务有密切的关系。在两个基准摘要数据集（即CNN/Dailymail和New York Times）上进行的实验表明，这三个目标都可以改善基线的性能。与在大规模数据（超过160GB）中预先训练的模型相比，我们的方法只有19GB的文本进行预训练，可实现可比的结果，这证明了其有效性。

Abstractive document summarization is usually modeled as a sequence-to-sequence (Seq2Seq) learning problem. Unfortunately, training large Seq2Seq based summarization models on limited supervised summarization data is challenging. This paper presents three pre-training objectives which allow us to pre-train a Seq2Seq based abstractive summarization model on unlabeled text. The main idea is that, given an input text artificially constructed from a document, a model is pre-trained to reinstate the original document. These objectives include sentence reordering, next sentence generation, and masked document generation, which have close relations with the abstractive document summarization task. Experiments on two benchmark summarization datasets (i.e., CNN/DailyMail and New York Times) show that all three objectives can improve performance upon baselines. Compared to models pre-trained on large-scale data (more than 160GB), our method, with only 19GB text for pre-training, achieves comparable results, which demonstrates its effectiveness.

下载PDF全文

下载文献需遵守相关版权规定

论文标题