论文标题
赌注:一种基于变压器的释义识别上下文中的简易数据增强方法的倒退方法
BET: A Backtranslation Approach for Easy Data Augmentation in Transformer-based Paraphrase Identification Context
论文作者
论文摘要
新引入的深度学习体系结构,即Bert,XLNet,Roberta和Albert,已被证明在几个NLP任务上是可靠的。但是,对这些体系结构进行培训的数据集是根据大小和概括性固定的。为了减轻此问题,我们应用最廉价的解决方案之一来更新这些数据集。我们称这种方法为基于变压器架构的倒流数据扩展。使用Google转化API,使用十个不同语言家族的十种中介语言,我们在基于变压器的框架中自动释义识别的背景下评估结果。我们的发现表明,BET改善了Microsoft研究措辞(MRPC)上的释义识别性能,提高了准确性和F1分数的3%以上。我们还使用降采样版本的MRPC,Twitter释义语料库(TPC)和Quora问题对分析了低数据制度中的增强。在许多低数据案例中,我们观察到从测试集上的失败模型转变为合理的性能。结果表明,BET是一种非常有前途的数据增强技术:推动现有数据集的最新最新技术,并在一百个样本的低数据策略中引导深度学习体系结构的利用。
Newly-introduced deep learning architectures, namely BERT, XLNet, RoBERTa and ALBERT, have been proved to be robust on several NLP tasks. However, the datasets trained on these architectures are fixed in terms of size and generalizability. To relieve this issue, we apply one of the most inexpensive solutions to update these datasets. We call this approach BET by which we analyze the backtranslation data augmentation on the transformer-based architectures. Using the Google Translate API with ten intermediary languages from ten different language families, we externally evaluate the results in the context of automatic paraphrase identification in a transformer-based framework. Our findings suggest that BET improves the paraphrase identification performance on the Microsoft Research Paraphrase Corpus (MRPC) to more than 3% on both accuracy and F1 score. We also analyze the augmentation in the low-data regime with downsampled versions of MRPC, Twitter Paraphrase Corpus (TPC) and Quora Question Pairs. In many low-data cases, we observe a switch from a failing model on the test set to reasonable performances. The results demonstrate that BET is a highly promising data augmentation technique: to push the current state-of-the-art of existing datasets and to bootstrap the utilization of deep learning architectures in the low-data regime of a hundred samples.