论文标题
为新闻报道产生代表头条
Generating Representative Headlines for News Stories
论文作者
论文摘要
每天在线发表数百万新闻文章,这可能使读者遵循。将同一事件报告为新闻报道的文章分组是协助读者新闻消费的一种常见方法。但是,有效,有效地为每个故事产生代表性的标题仍然是一个具有挑战性的研究问题。几十年来已经研究了文档集的自动汇总,而很少有研究重点是为一组文章生成代表性的头条新闻。与旨在捕获最少冗余信息的大多数信息的摘要不同,头条新闻旨在捕获故事文章共同共享的信息,并排除对每篇文章过于具体的信息。在这项工作中,我们研究了为新闻报道引起代表性头条新闻的问题。我们开发了一种遥远的监督方法来训练大型生成模型,而无需任何人类注释。该方法集中在两个技术组件上。首先,我们提出了一个多级预训练框架,该框架结合了具有不同质量的大规模未标记的语料库。在不同级别上的定量平衡。我们表明,在此框架内训练的模型优于那些接受纯人类策划语料库训练的模型。其次,我们提出了一个基于自我投票的新型文章注意力层,以提取由多个文章共享的显着信息。我们表明,结合此层的模型对新闻报道中的潜在噪音和出现有或没有噪音的现有基线的可能性很强。我们可以通过合并人类标签来进一步增强我们的模型,并且我们显示遥远的监督方法可大大减少对标记数据的需求。
Millions of news articles are published online every day, which can be overwhelming for readers to follow. Grouping articles that are reporting the same event into news stories is a common way of assisting readers in their news consumption. However, it remains a challenging research problem to efficiently and effectively generate a representative headline for each story. Automatic summarization of a document set has been studied for decades, while few studies have focused on generating representative headlines for a set of articles. Unlike summaries, which aim to capture most information with least redundancy, headlines aim to capture information jointly shared by the story articles in short length, and exclude information that is too specific to each individual article. In this work, we study the problem of generating representative headlines for news stories. We develop a distant supervision approach to train large-scale generation models without any human annotation. This approach centers on two technical components. First, we propose a multi-level pre-training framework that incorporates massive unlabeled corpus with different quality-vs.-quantity balance at different levels. We show that models trained within this framework outperform those trained with pure human curated corpus. Second, we propose a novel self-voting-based article attention layer to extract salient information shared by multiple articles. We show that models that incorporate this layer are robust to potential noises in news stories and outperform existing baselines with or without noises. We can further enhance our model by incorporating human labels, and we show our distant supervision approach significantly reduces the demand on labeled data.