论文标题
泰米尔语和卡纳达语自动语音识别的子词字典学习和细分技术
Subword Dictionary Learning and Segmentation Techniques for Automatic Speech Recognition in Tamil and Kannada
论文作者
论文摘要
我们根据子词建模提供了泰米尔语和卡纳达语的自动语音识别系统(ASR)系统,由于语言的高度紧密性,可以有效处理无限制的词汇。我们探索了字节对编码(BPE),并提出了该算法的一种名为Extended-BPE的变体,以及Morfessor工具将每个单词作为子词进行划分。我们已经有效地纳入了这些算法中加权有限状态传感器(WFST)框架的最大可能性(ML)和Viterbi估计技术,以从大型文本语料库中学习子单词词典。使用学习的子字典,将培训数据转录中的单词分割为子字,我们训练深神经网络ASR系统,这些系统识别任何给定的测试语音话语的子单词序列。然后,使用确定性规则对输出子字序列进行后处理,以获取最终单词序列,以便可以识别的实际单词数要大得多。对于泰米尔ASR,我们使用152个小时的数据进行培训和65小时的测试,而对于Kannada ASR,我们使用275小时的培训和72小时的测试。在使用分割和估计技术的不同组合进行实验后,我们发现与基线单词级ASR相比,单词错误率(WER)大大降低,分别实现了泰米尔人和Kannada的最大绝对降低6.24%和6.63%。
We present automatic speech recognition (ASR) systems for Tamil and Kannada based on subword modeling to effectively handle unlimited vocabulary due to the highly agglutinative nature of the languages. We explore byte pair encoding (BPE), and proposed a variant of this algorithm named extended-BPE, and Morfessor tool to segment each word as subwords. We have effectively incorporated maximum likelihood (ML) and Viterbi estimation techniques with weighted finite state transducers (WFST) framework in these algorithms to learn the subword dictionary from a large text corpus. Using the learnt subword dictionary, the words in training data transcriptions are segmented to subwords and we train deep neural network ASR systems which recognize subword sequence for any given test speech utterance. The output subword sequence is then post-processed using deterministic rules to get the final word sequence such that the actual number of words that can be recognized is much larger. For Tamil ASR, We use 152 hours of data for training and 65 hours for testing, whereas for Kannada ASR, we use 275 hours for training and 72 hours for testing. Upon experimenting with different combination of segmentation and estimation techniques, we find that the word error rate (WER) reduces drastically when compared to the baseline word-level ASR, achieving a maximum absolute WER reduction of 6.24% and 6.63% for Tamil and Kannada respectively.