论文标题
使用声学表示和ASR假设的组合流媒体语言识别
Streaming Language Identification using Combination of Acoustic Representations and ASR Hypotheses
论文作者
论文摘要
本文介绍了我们的建模和架构方法,用于构建高度准确的低延迟语言标识系统,以支持语音助手的多语言口语查询。解决多语言语音识别的一种常见方法是并行运行多个单语ASR系统,并依靠检测输入语言的语言识别(LID)组件。通常,盖子依靠仅声学信息来检测输入语言。我们提出了一种学习和将声学水平表示与ASR假设估计的嵌入的方法,与使用仅声学特征的模型相比,在ASR假设上估计,相对识别错误率相对降低高达50%。此外,为了降低处理成本和延迟,我们利用流式体系结构来识别系统在达到预定的置信度水平时尽早确定口语,从而减轻了运行多个ASR系统直至输入查询结束的需求。结合的声学和文本盖,再加上我们提出的流式运行时体系结构,平均会导致1500毫秒的早期识别,以超过50%的话语,几乎没有准确性的降解。我们还使用新提出的模型体系结构作为教师模型采用半监督学习(SSL)技术来显示出改进的结果。
This paper presents our modeling and architecture approaches for building a highly accurate low-latency language identification system to support multilingual spoken queries for voice assistants. A common approach to solve multilingual speech recognition is to run multiple monolingual ASR systems in parallel and rely on a language identification (LID) component that detects the input language. Conventionally, LID relies on acoustic only information to detect input language. We propose an approach that learns and combines acoustic level representations with embeddings estimated on ASR hypotheses resulting in up to 50% relative reduction of identification error rate, compared to a model that uses acoustic only features. Furthermore, to reduce the processing cost and latency, we exploit a streaming architecture to identify the spoken language early when the system reaches a predetermined confidence level, alleviating the need to run multiple ASR systems until the end of input query. The combined acoustic and text LID, coupled with our proposed streaming runtime architecture, results in an average of 1500ms early identification for more than 50% of utterances, with almost no degradation in accuracy. We also show improved results by adopting a semi-supervised learning (SSL) technique using the newly proposed model architecture as a teacher model.