论文标题
数据增强是否可以改善NLP的概括?
Does Data Augmentation Improve Generalization in NLP?
论文作者
论文摘要
神经模型通常会利用表面特征来实现良好的性能,而不是得出更一般的特征。克服这种趋势是代表学习和ML公平等领域的核心挑战。最近的工作提出了使用数据增强的,即生成表层特征失败的培训示例,以鼓励模型更喜欢更强的功能。我们设计了一系列的玩具学习问题,以测试以下假设:数据增强导致模型启发较弱,但不能在其位置学习更强大的功能。我们发现对这一假设的部分支持:数据增强在有所帮助之前通常会受到伤害,而当首选的强功能比竞争弱功能更难提取时,它的效率较小。
Neural models often exploit superficial features to achieve good performance, rather than deriving more general features. Overcoming this tendency is a central challenge in areas such as representation learning and ML fairness. Recent work has proposed using data augmentation, i.e., generating training examples where the superficial features fail, as a means of encouraging models to prefer the stronger features. We design a series of toy learning problems to test the hypothesis that data augmentation leads models to unlearn weaker heuristics, but not to learn stronger features in their place. We find partial support for this hypothesis: Data augmentation often hurts before it helps, and it is less effective when the preferred strong feature is much more difficult to extract than the competing weak feature.