论文标题

MaxMatch-Dropout:WordPiece的子字正规化

MaxMatch-Dropout: Subword Regularization for WordPiece

论文作者

Hiraoka, Tatsuya

论文摘要

我们为WordPiece提供了子字正规化方法,该方法使用了令牌化的最大匹配算法。提出的方法MaxMatch-Dropout使用最大匹配算法随机将单词随机删除。它可以通过流行预审预周封的语言模型(例如Bert-Base)的子字正规化实现填充。实验结果表明,MaxMatch-DropOut改善了文本分类和机器翻译任务的性能以及其他子单词正则化方法。此外,我们提供了子字正规化方法的比较分析:用句子(Unigram),BPE-Dropout和MaxMatch-Dropout的子字正规化。

We present a subword regularization method for WordPiece, which uses a maximum matching algorithm for tokenization. The proposed method, MaxMatch-Dropout, randomly drops words in a search using the maximum matching algorithm. It realizes finetuning with subword regularization for popular pretrained language models such as BERT-base. The experimental results demonstrate that MaxMatch-Dropout improves the performance of text classification and machine translation tasks as well as other subword regularization methods. Moreover, we provide a comparative analysis of subword regularization methods: subword regularization with SentencePiece (Unigram), BPE-Dropout, and MaxMatch-Dropout.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源