使用E2E建模进行流式传输的查询检测进行继续对话

论文标题

使用E2E建模进行流式传输的查询检测进行继续对话

Streaming Intended Query Detection using E2E Modeling for Continued Conversation

论文作者

Chang, Shuo-yiin, Prakash, Guru, Wu, Zelin, Liang, Qiao, Sainath, Tara N., Li, Bo, Stambler, Adam, Upadhyay, Shyam, Faruqui, Manaal, Strohman, Trevor

论文摘要

在启用语音的应用程序中，一个预定的热词与众不同，用于激活设备以便进行查询。 toavoid重复一个热词，我们提出了流式的端到端（E2E）打算的查询检测器，该查询检测器识别向设备指向的发出声音，并滤除针对设备的其他发出内容。 The proposed approach incor-porates the intended query detector into the E2E model thatalready folds different components of the speech recognitionpipeline into one neural network.The E2E modeling onspeech decoding and intended query detection also allows us todeclare a quick intended query detection based on early partialrecognition result, which is important to decrease latencyand make the system responsive.我们证明，与独立的预期检测器相比，检测准确性的相对提高一级误差率（EER）的相对提高了22％的相对提高一级误差率（EER），与独立的预期检测器相比，相对提高一级错误率（EER）。在我们的实验中，提出的模型检测用户正在用用户开始讲话后，用8.7％的Eerwithin与设备进行对话。

In voice-enabled applications, a predetermined hotword isusually used to activate a device in order to attend to the query.However, speaking queries followed by a hotword each timeintroduces a cognitive burden in continued conversations. Toavoid repeating a hotword, we propose a streaming end-to-end(E2E) intended query detector that identifies the utterancesdirected towards the device and filters out other utterancesnot directed towards device. The proposed approach incor-porates the intended query detector into the E2E model thatalready folds different components of the speech recognitionpipeline into one neural network.The E2E modeling onspeech decoding and intended query detection also allows us todeclare a quick intended query detection based on early partialrecognition result, which is important to decrease latencyand make the system responsive. We demonstrate that theproposed E2E approach yields a 22% relative improvement onequal error rate (EER) for the detection accuracy and 600 mslatency improvement compared with an independent intendedquery detector. In our experiment, the proposed model detectswhether the user is talking to the device with a 8.7% EERwithin 1.4 seconds of median latency after user starts speaking.

下载PDF全文

下载文献需遵守相关版权规定

论文标题