论文标题
对话框介绍:将文档转换为对话框
Dialog Inpainting: Turning Documents into Dialogs
论文作者
论文摘要
许多重要的问题(例如,“如何健康?”)需要对话以建立背景并深入探索。但是,长期以来,稀缺的培训数据将收集量昂贵的培训数据困扰着对话率的答案(CONSQA)系统。为了解决这个问题,我们提出了一种新技术,用于合成生成多样化和高质量的对话框数据:对话框介绍。我们的方法将任何文档的文本都转换为作者和想象中的读者之间的两人对话:我们将文章中的句子视为作者所说的话语,然后使用对话框内的插件来预测想象中的读者在每个作者的话语之间所问或说的内容。通过将这种方法应用于Wikipedia和Web的段落,我们生产了Wikidialog和WebDialog,两个数据集总计1900万个不同的信息寻求对话框 - 比最大的Convqa数据集大1,000倍。此外,人类评估者认为Wikidialog的答案充分性和对话性比现有手动收集的数据集一样好或更好。使用我们的授予数据进行预训练ConvQA检索系统,我们在三个基准(QRECC,OR-QUAC,TREC,TREC Cast)上大大提高了最新的最新技术,可在标准评估指标上获得高达40%的相对增长。
Many important questions (e.g. "How to eat healthier?") require conversation to establish context and explore in depth. However, conversational question answering (ConvQA) systems have long been stymied by scarce training data that is expensive to collect. To address this problem, we propose a new technique for synthetically generating diverse and high-quality dialog data: dialog inpainting. Our approach takes the text of any document and transforms it into a two-person dialog between the writer and an imagined reader: we treat sentences from the article as utterances spoken by the writer, and then use a dialog inpainter to predict what the imagined reader asked or said in between each of the writer's utterances. By applying this approach to passages from Wikipedia and the web, we produce WikiDialog and WebDialog, two datasets totalling 19 million diverse information-seeking dialogs -- 1,000x larger than the largest existing ConvQA dataset. Furthermore, human raters judge the answer adequacy and conversationality of WikiDialog to be as good or better than existing manually-collected datasets. Using our inpainted data to pre-train ConvQA retrieval systems, we significantly advance state-of-the-art across three benchmarks (QReCC, OR-QuAC, TREC CAsT) yielding up to 40% relative gains on standard evaluation metrics.