论文标题
自然对话演讲的转弯预测
Turn-Taking Prediction for Natural Conversational Speech
论文作者
论文摘要
尽管流媒体助手系统已在许多应用中使用,但该系统通常集中在不自然的一击交互上,假设来自单个语音查询的输入毫不犹豫地或毫无疑问。但是,一个常见的对话话语通常还涉及多个转弯的查询,除了反弹之外。这些疏远包括暂停思考,犹豫,延长单词,填补的停顿和重复的短语。这使得通过对话演讲进行语音识别,其中包括有多个查询,这是一项具有挑战性的任务。为了更好地建模对话互动,至关重要的是,区分汇率和查询的结束,以便用户在用户完成讲话后尽快响应汇率,同时使用户保持地板的折衷。在本文中,我们提出了一个基于端到端(E2E)语音识别器的转折预测指标。我们的最佳系统是通过共同优化ASR任务并检测用户何时停止思考或完成讲话来获得的。所提出的方法显示,在预测真正的转弯效果时,仅100毫秒延迟在测试集中,设计出40毫秒的召回率和85%的精度率,设计了4种类型的对话说法中插入4种这种情况。
While a streaming voice assistant system has been used in many applications, this system typically focuses on unnatural, one-shot interactions assuming input from a single voice query without hesitation or disfluency. However, a common conversational utterance often involves multiple queries with turn-taking, in addition to disfluencies. These disfluencies include pausing to think, hesitations, word lengthening, filled pauses and repeated phrases. This makes doing speech recognition with conversational speech, including one with multiple queries, a challenging task. To better model the conversational interaction, it is critical to discriminate disfluencies and end of query in order to allow the user to hold the floor for disfluencies while having the system respond as quickly as possible when the user has finished speaking. In this paper, we present a turntaking predictor built on top of the end-to-end (E2E) speech recognizer. Our best system is obtained by jointly optimizing for ASR task and detecting when the user is paused to think or finished speaking. The proposed approach demonstrates over 97% recall rate and 85% precision rate on predicting true turn-taking with only 100 ms latency on a test set designed with 4 types of disfluencies inserted in conversational utterances.