论文标题
端到端口头对话问题回答:任务,数据集和模型
End-to-end Spoken Conversational Question Answering: Task, Dataset and Model
论文作者
论文摘要
在口头问题回答中,这些系统旨在回答相关语音成绩单中连续文本的问题。但是,人类寻求或测试知识的最自然方式是通过人类的对话。因此,我们提出了一项新的口语对话问题回答任务(SCQA),旨在使系统能够在鉴于语音文档的情况下对复杂的对话流进行建模。在此任务中,我们的主要目标是建立系统,以根据录音来处理对话性问题,并探讨提供更多来自不同模式的信息,并在信息收集中提供更多的线索。为此,我们提出了一种新型的统一数据蒸馏方法DDNET,而不是直接采用具有高度嘈杂数据的自动生成语音转录本,该方法有效地摄入了跨模式信息以实现语音和语言方式的细粒度表示。此外,我们提出了一种简单而新颖的机制,称为双重注意,通过鼓励音频和文本之间的更好的对齐方式来简化知识传递的过程。为了在对话式的交互中评估SCQA系统的能力,我们通过4K对话中的超过40k的问答对,组装了一个口语对话问题答案(Spoke-Coqa)数据集。现有最新方法的性能在我们的数据集上大大降低,因此证明了跨模式信息集成的必要性。我们的实验结果表明,我们提出的方法在对话式问答任务中实现了卓越的表现。
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts. However, the most natural way that human seek or test their knowledge is via human conversations. Therefore, we propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows given the speech documents. In this task, our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering. To this end, instead of directly adopting automatically generated speech transcripts with highly noisy data, we propose a novel unified data distillation approach, DDNet, which effectively ingests cross-modal information to achieve fine-grained representations of the speech and language modalities. Moreover, we propose a simple and novel mechanism, termed Dual Attention, by encouraging better alignments between audio and text to ease the process of knowledge transfer. To evaluate the capacity of SCQA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 40k question-answer pairs from 4k conversations. The performance of the existing state-of-the-art methods significantly degrade on our dataset, hence demonstrating the necessity of cross-modal information integration. Our experimental results demonstrate that our proposed method achieves superior performance in spoken conversational question answering tasks.