用因果均值聚集进行流式reslstm，用于设备定向的话语检测

论文标题

用因果均值聚集进行流式reslstm，用于设备定向的话语检测

Streaming ResLSTM with Causal Mean Aggregation for Device-Directed Utterance Detection

论文作者

Tong, Xiaosu, Huang, Che-Wei, Mallidi, Sri Harish, Joseph, Shaun, Pareek, Sonal, Chandak, Chander, Rastrow, Ariya, Maas, Roland

论文摘要

在本文中，我们提出了一个流媒体模型，以区分旨在用于智能家居设备的语音查询和背景语音。所提出的模型由多个具有剩余连接的CNN层组成，然后是堆叠的LSTM架构。通过使用单向LSTM层和因果平均聚集层来实现流式功能，以形成到当前帧的最终话语级别的预测。为了避免在线流推断期间的冗余计算，我们在每个卷积操作中都使用一个缓存机制。与以前的最佳模型相比，对设备定向与非设备指导任务的实验结果表明，提出的模型降低41％。此外，我们表明，与基于注意力的模型相比，所提出的模型能够在时间上准确预测。

In this paper, we propose a streaming model to distinguish voice queries intended for a smart-home device from background speech. The proposed model consists of multiple CNN layers with residual connections, followed by a stacked LSTM architecture. The streaming capability is achieved by using unidirectional LSTM layers and a causal mean aggregation layer to form the final utterance-level prediction up to the current frame. In order to avoid redundant computation during online streaming inference, we use a caching mechanism for every convolution operation. Experimental results on a device-directed vs. non device-directed task show that the proposed model yields an equal error rate reduction of 41% compared to our previous best model on this task. Furthermore, we show that the proposed model is able to accurately predict earlier in time compared to the attention-based models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题