字节的单词级表示语言建模

论文标题

字节的单词级表示语言建模

Word-Level Representation From Bytes For Language Modeling

论文作者

Lee, Chu-Tak, Guo, Qipeng, Qiu, Xipeng

论文摘要

现代语言模型主要以子字为输入，这种设计可以平衡词汇大小，参数数量和性能之间的权衡。但是，子词令牌化仍然存在缺点，例如对噪音不强大，难以推广到新语言。同样，扩展模型的当前趋势表明，较大的模型需要较大的嵌入，但这会使并行化难。图像分类的先前工作证明将原始输入分成一系列Chuck是一种强烈的模型感应性偏差。基于此观察，我们重新考虑了采用字符级输入但进行单词级序列建模和预测的现有字符感知方法。我们通过引入一个跨注意网络来大修此方法，该网络直接从字节中构建单词级表示，以及基于单词级别隐藏状态的子字级别的预测，以避免文字级别预测的时间和空间要求。通过这两个改进的结合，我们拥有一个带有纤细的输入嵌入的代币模型，用于下游任务。我们命名我们的方法字节2个字样，并对语言建模和文本分类进行评估。实验表明，字节2word与强次基线BERT相当，但仅占嵌入尺寸的10 \％。我们进一步测试了合成噪声和跨语性转移的方法，并在两种情况下都发现它与基线方法有竞争力。

Modern language models mostly take sub-words as input, a design that balances the trade-off between vocabulary size, number of parameters, and performance. However, sub-word tokenization still has disadvantages like not being robust to noise and difficult to generalize to new languages. Also, the current trend of scaling up models reveals that larger models require larger embeddings but that makes parallelization hard. Previous work on image classification proves splitting raw input into a sequence of chucks is a strong, model-agnostic inductive bias. Based on this observation, we rethink the existing character-aware method that takes character-level inputs but makes word-level sequence modeling and prediction. We overhaul this method by introducing a cross-attention network that builds word-level representation directly from bytes, and a sub-word level prediction based on word-level hidden states to avoid the time and space requirement of word-level prediction. With these two improvements combined, we have a token free model with slim input embeddings for downstream tasks. We name our method Byte2Word and perform evaluations on language modeling and text classification. Experiments show that Byte2Word is on par with the strong sub-word baseline BERT but only takes up 10\% of embedding size. We further test our method on synthetic noise and cross-lingual transfer and find it competitive to baseline methods on both settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题