论文标题
利用非i.i.d。数据指向更健壮的机器学习算法
Exploiting non-i.i.d. data towards more robust machine learning algorithms
论文作者
论文摘要
在机器学习领域,人们对更强大和可推广的算法越来越感兴趣。例如,这对于弥合收集训练数据的环境与部署算法的环境之间的差距很重要。机器学习算法越来越多地证明在查找数据的模式和相关性方面表现出色。确定这些模式的一致性,例如因果关系与荒谬的虚假关系之间的区别已被证明要困难得多。在本文中,引入了一种正规化方案,该方案更喜欢普遍的因果关系。这种方法基于1)因果关系的鲁棒性和2)数据不是独立和相同分布的数据(i.i.d.)。通过将(非I.I.D。)培训集中在亚群中的(非i.i.d。)培训来证明该方案。非i.i.d。然后引入正规化项,以惩罚这些集群并非不变的权重。所得的算法有利于在亚群中通用的相关性,并且实际上,就更常规的L_2型规范化而言,在分布外测试集中获得了更好的性能。
In the field of machine learning there is a growing interest towards more robust and generalizable algorithms. This is for example important to bridge the gap between the environment in which the training data was collected and the environment where the algorithm is deployed. Machine learning algorithms have increasingly been shown to excel in finding patterns and correlations from data. Determining the consistency of these patterns and for example the distinction between causal correlations and nonsensical spurious relations has proven to be much more difficult. In this paper a regularization scheme is introduced that prefers universal causal correlations. This approach is based on 1) the robustness of causal correlations and 2) the data not being independently and identically distribute (i.i.d.). The scheme is demonstrated with a classification task by clustering the (non-i.i.d.) training set in subpopulations. A non-i.i.d. regularization term is then introduced that penalizes weights that are not invariant over these clusters. The resulting algorithm favours correlations that are universal over the subpopulations and indeed a better performance is obtained on an out-of-distribution test set with respect to a more conventional l_2-regularization.