我什么时候可以说话？预测口语对话代理的起始点

论文标题

我什么时候可以说话？预测口语对话代理的起始点

When can I Speak? Predicting initiation points for spoken dialogue agents

论文作者

Li, Siyan, Paranjape, Ashwin, Manning, Christopher D.

论文摘要

当前的口语对话系统在长时间的沉默（700-1000ms）之后启动了转弯，这导致了几乎没有实时反馈，缓慢的反应和整体刻板的对话流。人类通常在200ms之内做出反应，并成功预测提前的起始点将使口语对话代理也能够做到这一点。在这项工作中，我们使用预先训练的语音表示模型（WAV2VEC 1.0）的韵律功能预测启动时间，并在用户音频和Word功能上运行的先前训练语言模型（GPT-2）在增量转录上运行。为了评估错误，我们提出了两个指标W.R.T.预测和真实的交货时间。我们训练和评估了总结板上的模型，发现我们的方法的特征优于指标的先前工作，并且大大优于等待700ms沉默的常见方法。

Current spoken dialogue systems initiate their turns after a long period of silence (700-1000ms), which leads to little real-time feedback, sluggish responses, and an overall stilted conversational flow. Humans typically respond within 200ms and successfully predicting initiation points in advance would allow spoken dialogue agents to do the same. In this work, we predict the lead-time to initiation using prosodic features from a pre-trained speech representation model (wav2vec 1.0) operating on user audio and word features from a pre-trained language model (GPT-2) operating on incremental transcriptions. To evaluate errors, we propose two metrics w.r.t. predicted and true lead times. We train and evaluate the models on the Switchboard Corpus and find that our method outperforms features from prior work on both metrics and vastly outperforms the common approach of waiting for 700ms of silence.

下载PDF全文

下载文献需遵守相关版权规定

论文标题