论文标题
用因果均值聚集进行流式reslstm,用于设备定向的话语检测
Streaming ResLSTM with Causal Mean Aggregation for Device-Directed Utterance Detection
论文作者
论文摘要
在本文中,我们提出了一个流媒体模型,以区分旨在用于智能家居设备的语音查询和背景语音。所提出的模型由多个具有剩余连接的CNN层组成,然后是堆叠的LSTM架构。通过使用单向LSTM层和因果平均聚集层来实现流式功能,以形成到当前帧的最终话语级别的预测。为了避免在线流推断期间的冗余计算,我们在每个卷积操作中都使用一个缓存机制。与以前的最佳模型相比,对设备定向与非设备指导任务的实验结果表明,提出的模型降低41%。此外,我们表明,与基于注意力的模型相比,所提出的模型能够在时间上准确预测。
In this paper, we propose a streaming model to distinguish voice queries intended for a smart-home device from background speech. The proposed model consists of multiple CNN layers with residual connections, followed by a stacked LSTM architecture. The streaming capability is achieved by using unidirectional LSTM layers and a causal mean aggregation layer to form the final utterance-level prediction up to the current frame. In order to avoid redundant computation during online streaming inference, we use a caching mechanism for every convolution operation. Experimental results on a device-directed vs. non device-directed task show that the proposed model yields an equal error rate reduction of 41% compared to our previous best model on this task. Furthermore, we show that the proposed model is able to accurately predict earlier in time compared to the attention-based models.