论文标题
渐进式语音触发检测:准确性与潜伏期
Progressive Voice Trigger Detection: Accuracy vs Latency
论文作者
论文摘要
我们为虚拟助手提供了用于语音触发检测的体系结构。这项工作的主要思想是用立即遵循触发短语的单词来利用信息。我们首先证明,通过在检测到的触发短语之后包括更多音频上下文,我们确实可以做出更准确的决定。但是,每次等待听取更多音频会导致潜伏期的增加。渐进的语音触发检测可以通过快速接受清晰的触发候选者来权衡潜伏期和准确性,但要等待更多上下文决定是否接受更多的边际例子。使用两阶段的体系结构,我们表明,通过延迟测试集中检测到的真实触发器的3%的决定,我们能够获得虚假拒绝率的相对提高66%,而仅产生了延迟的可忽略不计。
We present an architecture for voice trigger detection for virtual assistants. The main idea in this work is to exploit information in words that immediately follow the trigger phrase. We first demonstrate that by including more audio context after a detected trigger phrase, we can indeed get a more accurate decision. However, waiting to listen to more audio each time incurs a latency increase. Progressive Voice Trigger Detection allows us to trade-off latency and accuracy by accepting clear trigger candidates quickly, but waiting for more context to decide whether to accept more marginal examples. Using a two-stage architecture, we show that by delaying the decision for just 3% of detected true triggers in the test set, we are able to obtain a relative improvement of 66% in false rejection rate, while incurring only a negligible increase in latency.