论文标题
我们可以用更少的钱取得更多吗?探索有毒评论分类的数据增强
Can We Achieve More with Less? Exploring Data Augmentation for Toxic Comment Classification
论文作者
论文摘要
本文解决了机器学习中最大的限制之一:数据稀缺。具体来说,我们探讨了是否可以利用数据增强技术和机器学习算法的组合来构建高精度分类器。在本文中,我们尝试了易于数据增强(EDA)和反向翻译,以及三种流行的学习算法,逻辑回归,支持向量机(SVM)和双向长期短期存储网络(BI-LSTM)。为了进行实验,我们利用Wikipedia有毒评论数据集,以便在探索数据增强的好处的过程中,我们可以开发一个模型来检测和对有毒语音进行评论中的有毒语音,以帮助反击网络欺凌和在线骚扰。最终,我们发现数据增强技术可用于显着提高分类器的性能,并且是打击NLP问题中缺乏数据的绝佳策略。
This paper tackles one of the greatest limitations in Machine Learning: Data Scarcity. Specifically, we explore whether high accuracy classifiers can be built from small datasets, utilizing a combination of data augmentation techniques and machine learning algorithms. In this paper, we experiment with Easy Data Augmentation (EDA) and Backtranslation, as well as with three popular learning algorithms, Logistic Regression, Support Vector Machine (SVM), and Bidirectional Long Short-Term Memory Network (Bi-LSTM). For our experimentation, we utilize the Wikipedia Toxic Comments dataset so that in the process of exploring the benefits of data augmentation, we can develop a model to detect and classify toxic speech in comments to help fight back against cyberbullying and online harassment. Ultimately, we found that data augmentation techniques can be used to significantly boost the performance of classifiers and are an excellent strategy to combat lack of data in NLP problems.